欢迎来到淘文阁 - 分享文档赚钱的网站! | 帮助中心 好文档才是您的得力助手!
淘文阁 - 分享文档赚钱的网站
全部分类
  • 研究报告>
  • 管理文献>
  • 标准材料>
  • 技术资料>
  • 教育专区>
  • 应用文书>
  • 生活休闲>
  • 考试试题>
  • pptx模板>
  • 工商注册>
  • 期刊短文>
  • 图片设计>
  • ImageVerifierCode 换一换

    机器学习题库[002].docx

    • 资源ID:55662432       资源大小:110.36KB        全文页数:15页
    • 资源格式: DOCX        下载积分:15金币
    快捷下载 游客一键下载
    会员登录下载
    微信登录下载
    三方登录下载: 微信开放平台登录   QQ登录  
    二维码
    微信扫一扫登录
    下载资源需要15金币
    邮箱/手机:
    温馨提示:
    快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
    如填写123,账号就是123,密码也是123。
    支付方式: 支付宝    微信支付   
    验证码:   换一换

     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    机器学习题库[002].docx

    机器学习题库一、 极大似然1、 ML estimation of exponential model (10)A Gaussian distribution is often used to model data on the real line, but is sometimes inappropriate when the data are often close to zero but constrained to be nonnegative. In such cases one can fit an exponential distribution, whose probability density function is given byGiven N observations xi drawn from such a distribution:(a) Write down the likelihood as a function of the scale parameter b.(b) Write down the derivative of the log likelihood.(c) Give a simple expression for the ML estimate for b.2、换成Poisson分布:二、 贝叶斯1、 贝叶斯公式应用假设在考试的多项选择中,考生知道正确答案的概率为p,猜想答案的概率为1-p,并且假设考生知道正确答案答对题的概率为1,猜中正确答案的概率为,其中m为多项选择项的数目。那么考生答对题目,求他知道正确答案的概率。:2、 Conjugate priorsGiven a likelihood for a class models with parameters , a conjugate prior is a distribution with hyperparameters , such that the posterior distribution及先验的分布族一样(a) Suppose that the likelihood is given by the exponential distribution with rate parameter :Show that the gamma distribution _is a conjugate prior for the exponential. Derive the parameter update given observations and the prediction distribution .(b) Show that the beta distribution is a conjugate prior for the geometric distributionwhich describes the number of time a coin is tossed until the first heads appears, when the probability of heads on each toss is . Derive the parameter update rule and prediction distribution.(c) Suppose is a conjugate prior for the likelihood ; show that the mixture prioris also conjugate for the same likelihood, assuming the mixture weights wm sum to 1. (d) Repeat part (c) for the case where the prior is a single distribution and the likelihood is a mixture, and the prior is conjugate for each mixture component of the likelihood.some priors can be conjugate for several different likelihoods; for example, the beta is conjugate for the Bernoulliand the geometric distributions and the gamma is conjugate for the exponential and for the gamma with fixed (e) (Extra credit, 20) Explore the case where the likelihood is a mixture with fixed components and unknown weights; i.e., the weights are the parameters to be learned.三、 判断题1给定n个数据点,如果其中一半用于训练,另一半用于测试,那么训练误差和测试误差之间的差异会随着n的增加而减小。2极大似然估计是无偏估计且在所有的无偏估计中方差最小,所以极大似然估计的风险最小。回归函数A和B,如果A比B更简单,那么A几乎一定会比B在测试集上表现更好。全局线性回归需要利用全部样本点来预测新输入的对应输出值,而局部线性回归只需利用查询点附近的样本来预测输出值。所以全局线性回归比局部线性回归计算代价更高。Boosting和Bagging都是组合多个分类器投票的方法,二者都是根据单个分类器的正确率决定其权重。() In the boosting iterations, the training error of each new decision stump and the training error of the combined classifier vary roughly in concert FWhile the training error of the combined classifier typically decreases as a function of boosting iterations, the error of the individual decision stumps typically increases since the example weights become concentrated at the most difficult examples.() One advantage of Boosting is that it does not overfit. F() Support vector machines are resistant to outliers, i.e., very noisy examples drawn from a different distribution. 9在回归分析中,最正确子集选择可以做特征选择,当特征数目较多时计算量大;岭回归和Lasso模型计算量小,且Lasso也可以实现特征选择。10当训练数据较少时更容易发生过拟合。11梯度下降有时会陷于局部极小值,但EM算法不会。12在核回归中,最影响回归的过拟合性和欠拟合之间平衡的参数为核函数的宽度。(13) In the AdaBoost algorithm, the weights on all the misclassified points will go up by the same multiplicative factor. T(14) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty cannot decrease the L2 error of the solution w on the training data. F(15) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty always decreases the expected L2 error of the solution w on unseen test data F.(16)除了EM算法,梯度下降也可求混合高斯模型的参数。 (T)(20) Any decision boundary that we get from a generative model with class-conditional Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel.True! In fact, since class-conditional Gaussians always yield quadratic decision boundaries, they can be reproduced with an SVM with kernel of degree less than or equal to two.(21) AdaBoost will eventually reach zero training error, regardless of the type of weak classifier it uses, provided enough weak classifiers have been combined.False! If the data is not separable by a linear combination of the weak classifiers, AdaBoost cant achieve zero training error.(22) The L2 penalty in a ridge regression is equivalent to a Laplace prior on the weights. F(23) The log-likelihood of the data will always increase through successive iterations of the expectation maximation algorithm. (F)(24) In training a logistic regression model by maximizing the likelihood of the labels given the inputs we have multiple locally optimal solutions. (F)四、 回归1、考虑回归一个正那么化回归问题。在下列图中给出了惩罚函数为二次正那么函数,当正那么化参数C取不同值时,在训练集和测试集上的log似然mean log-probability。10分1说法“随着C的增加,图2中训练集上的log似然永远不会增加是否正确,并说明理由。2解释当C取较大值时,图2中测试集上的log似然下降的原因。2、考虑线性回归模型:,训练数据如下列图所示。10分1用极大似然估计参数,并在图a中画出模型。3分2用正那么化的极大似然估计参数,即在log似然目标函数中参加正那么惩罚函数,并在图b中画出当参数C取很大值时的模型。3分3在正那么化后,高斯分布的方差是变大了、变小了还是不变?4分图(a) 图(b)3. 考虑二维输入空间点上的回归问题,其中在单位正方形内。训练样本和测试样本在单位正方形中均匀分布,输出模型为,我们用1-10阶多项式特征,采用线性回归模型来学习x及y之间的关系高阶特征模型包含所有低阶特征,损失函数取平方误差损失。(1) 现在个样本上,训练1阶、2阶、8阶和10阶特征的模型,然后在一个大规模的独立的测试集上测试,那么在下3列中选择适宜的模型可能有多个选项,并解释第3列中你选择的模型为什么测试误差小。10分训练误差最小训练误差最大测试误差最小1阶特征的线性模型X2阶特征的线性模型X8阶特征的线性模型X10阶特征的线性模型X(2) 现在个样本上,训练1阶、2阶、8阶和10阶特征的模型,然后在一个大规模的独立的测试集上测试,那么在下3列中选择适宜的模型可能有多个选项,并解释第3列中你选择的模型为什么测试误差小。10分训练误差最小训练误差最大测试误差最小1阶特征的线性模型X2阶特征的线性模型8阶特征的线性模型XX10阶特征的线性模型X(3) The approximation error of a polynomial regression model depends on the number of training points. (T)(4) The structural error of a polynomial regression model depends on the number of training points. (F)4、We are trying to learn regression parameters for a dataset which we know was generated from a polynomial of a certain degree, but we do not know what this degree is. Assume the data was actually generated from a polynomial of degree 5 with some added Gaussian noise (that is .For training we have 100 x,y pairs and for testing we are using an additional set of 100 x,y pairs. Since we do not know the degree of the polynomial we learn two models from the data. Model A learns parameters for a polynomial of degree 4 and model B learns parameters for a polynomial of degree 6. Which of these two models is likely to fit the test data better?Answer: Degree 6 polynomial. Since the model is a degree 5 polynomial and we have enough training data, the model we learn for a six degree polynomial will likely fit a very small coefficient for x6 . Thus, even though it is a six degree polynomial it will actually behave in a very similar way to a fifth degree polynomial which is the correct model leading to better fit to the data.5、Input-dependent noise in regressionOrdinary least-squares regression is equivalent to assuming that each data point is generated according to a linear function of the input plus zero-mean, constant-variance Gaussian noise. In many systems, however, the noise variance is itself a positive linear function of the input (which is assumed to be non-negative, i.e., x >= 0).a) Which of the following families of probability models correctly describes this situation in the univariate case? (Hint: only one of them does.)(iii) is correct. In a Gaussian distribution over y, the variance is determined by the coefficient of y2; so by replacing by , we get a variance that increases linearly with x. (Note also the change to the normalization “constant.) (i) has quadratic dependence on x; (ii) does not change the variance at all, it just renames w1.b) Circle the plots in Figure 1 that could plausibly have been generated by some instance of the model family(ies) you chose.(ii) and (iii). (Note that (iii) works for .) (i) exhibits a large variance at x = 0, and the variance appears independent of x.c) True/False: Regression with input-dependent noise gives the same solution as ordinary regression for an infinite data set generated according to the corresponding model.True. In both cases the algorithm will recover the true underlying model.d) For the model you chose in part (a), write down the derivative of the negative log likelihood with respect to w1.五、 分类1. 产生式模型 vs. 判别式模型(a) Your billionaire friend needs your help. She needs to classify job applications into good/bad categories, and also to detect job applicants who lie in their applications using density estimation to detect outliers. To meet these needs, do you recommend using a discriminative or generative classifier? Why? 产生式模型因为要估计密度(b) Your billionaire friend also wants to classify software applications to detect bug-prone applications using features of the source code. This pilot project only has a few applications to be used as training data, though. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why?判别式模型样本数较少,通常用判别式模型直接分类效果会好些(d) Finally, your billionaire friend also wants to classify companies to decide which one to acquire. This project has lots of training data based on several decades of research. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why?产生式模型样本数很多时,可以学习到正确的产生式模型2、logstic回归Figure 2: Log-probability of labels as a function of regularization parameter CHere we use a logistic regression model to solve a classification problem. In Figure 2, we have plotted the mean log-probability of labels in the training and test sets after having trained the classifier with quadratic regularization penalty and different values of the regularization parameter C.1、 In training a logistic regression model by maximizing the likelihood of the labels given the inputs we have multiple locally optimal solutions. (F)Answer: The log-probability of labels given examples implied by the logistic regression model is a concave (convex down) function with respect to the weights. The (only) locally optimal solution is also globally optimal2、 A stochastic gradient algorithm for training logistic regression models with a fixed learning rate will find the optimal setting of the weights exactly. FAnswer: A fixed learning rate means that we are always taking a finite step towards improving the log-probability of any single training example in the update equation. Unless the examples are somehow “aligned, we will continue jumping from side to side of the optimal solution, and will not be able to get arbitrarily close to it. The learning rate has to approach to zero in the course of the updates for the weights to converge.3、 The average log-probability of training labels as in Figure 2 can never increase as we increase C. TStronger regularization means more constraints on the solution and thus the (average) log-probability of the training examples can only get worse.4、 Explain why in Figure 2 the test log-probability of labels decreases for large values of C. As C increases, we give more weight to constraining the predictor, and thus give less flexibility to fitting the training set. The increased regularization guarantees that the test performance gets closer to the training performance, but as we over-constrain our allowed predictors, we are not able to fit the training set at all, and although the test performance is now very close to the training performance, both are low.5、 The log-probability of labels in the test set would decrease for large values of C even if we had a large number of training examples. TThe above argument still holds, but the value of C for which we will observe such a decrease will scale up with the number of examples.6、 Adding a quadratic regularization penalty for the parameters when estimating a logistic regression model ensures that some of the parameters (weights associated with the components of the input vectors) vanish.A regularization penalty for feature selection must have non-zero derivative at zero. Otherwise, the regularization has no effect at zero, and weight will tend to be slightly non-zero, even when this does not improve the log-probabilities by much.3、正那么化的Logstic回归This problem we will refer to the binary classification task depicted in Figure 1(a), which we attempt to solve with the simple linear logistic regression model(for simplicity we do not use the bias parameter w0). The training data can be separated with zero training error - see line L1 in Figure 1(b) for instance.(a) The 2-dimensional data set used in Problem 2(b) The points can be separated by L1 (solid line). Possible other decision boundaries are shown by L2;L3;L4.(1) Consider a regularization approach where we try to maximize for large C. Note that only w2 is penalized. Wed like to know which of the four lines in Figure 1(b) could arise as a result of such regularization. For each potential line L2, L3 or L4 determine whether it can result from regularizing w2. If not, explain very briefly why not.L2: No. When we regularize w2, the resulting boundary can rely less on the value of x2 and therefore becomes more vertical. L2 here seems to be more horizontal than the unregularized solution so it cannot come as a result of penalizing w2L3: Yes. Here w22 is small relative to w12 (as evidenced by high slope), and even though it would assign a rather low log-probability to the observed labels, it could be forced by a large regularization parameter C.L4: No. For very large C, we get a boundary that is entirely vertical (line x1 = 0 or the x2 axis). L4 here is reflected across the x2 axis and represents a poorer solution than its counter part on the other side. For moderate regularization we have to get the best solution that we can construct while keeping w2 small. L4 is not the best and thus cannot come as a result of regularizing w2.(2) If we change the form of regularization to one-norm (absolute value) and also regularize w1 we get the following penalized log-likelihoodConsider again the problem in Figure 1(a) and the same linear logistic regression model. As we increase the regularization parameter C which of the following scenarios do you expect to observe (choose only one):( x ) First w1 will become 0, then w2.( ) w1 and w2 will become zero simultaneously( ) First w2 will become 0, then w1.( ) None of the weights will become exactly zero, only smaller as C increasesThe data can be classified with zero training error and therefore also with high log-probability by looking at the value of x2 alone, i.e. making w1 = 0. Initially we might prefer to have a non-zero value for w1 but it will go to zero rather quickly as we increase regularization. Note that we pay a regularization penalty for a non-zero value o

    注意事项

    本文(机器学习题库[002].docx)为本站会员(叶***)主动上传,淘文阁 - 分享文档赚钱的网站仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知淘文阁 - 分享文档赚钱的网站(点击联系客服),我们立即给予删除!

    温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。




    关于淘文阁 - 版权申诉 - 用户使用规则 - 积分规则 - 联系我们

    本站为文档C TO C交易模式,本站只提供存储空间、用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。本站仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知淘文阁网,我们立即给予删除!客服QQ:136780468 微信:18945177775 电话:18904686070

    工信部备案号:黑ICP备15003705号 © 2020-2023 www.taowenge.com 淘文阁 

    收起
    展开