数据挖掘第二次作业(共11页).doc

上传人：飞****2

文档编号：14050427

上传时间：2022-05-02

格式：DOC

页数：11

大小：606KB

( 4.5 )

《数据挖掘第二次作业(共11页).doc》由会员分享，可在线阅读，更多相关《数据挖掘第二次作业(共11页).doc（11页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、精选优质文档-倾情为你奉上数据挖掘第二次作业第一题：1. a) Compute the Information Gain for Gender, Car Type and Shirt Size.b) Construct a decision tree with Information Gain.答案：a) 因为class分为两类：C0和C1，其中C0的频数为10个，C1的频数为10，所以class元组的信息增益为Info(D)=11.按照Gender进行分类：Infogender(D)=0.971Gain(Gender)=1-0.971=0.0292.按照Car Type进行分类Infocar

2、Type(D)=0.314Gain(Car Type)=1-0.314=0.6863.按照Shirt Size进行分类：InfoshirtSize(D)=0.988Gain(Shirt Size)=1-0.988=0.012b) 由a中的信息增益结果可以看出采用Car Type进行分类得到的信息增益最大，所以决策树为：Car Type?medium,large, extra largesmallC1C0C0luxurySportfamilyShirt Size?C1第二题：2. (a) Design a multilayer feed-forward neural network (one h

3、idden layer) for the data set in Q1. Label the nodes in the input and output layers.(b) Using the neural network obtained above, show the weight values after one iteration of the back propagation algorithm, given the training instance “(M, Family, Small). Indicate your initial weight values and bias

4、es and the learning rate used.a)b) 由a可以设每个输入单元代表的属性和初始赋值X11X12X21X22X23X31X32X33X34FMFamilySportsLuxurySmallMediumLargeExtra Large011001000由于初始的权重和偏倚值是随机生成的所以在此定义初始值为：W1,10W1,11W2,10W2,11W3,10W3,11W4,10W4,11W5,10W5,110.20.2-0.2-0.10.40.3-0.2-0.10.1-0.1W6,10W6,11W7,10W7,11W8,10W8,11W9,10W9,11W10,12W1

5、1,120.1-0.2-0.40.20.20.2-0.10.3-0.3-0.1101112-0.20.20.3净输入和输出：单元 j净输入 Ij输出Oj100.10.52110.20.55120.0890.48每个节点的误差表：单元jErrj100.0089110.003012-0.12权重和偏倚的更新：W1,10W1,11W2,10W2,11W3,10W3,11W4,10W4,11W5,10W5,110.2010.198-0.211-0.0990.40.308-0.202-0.0980.101-0.100W6,10W6,11W7,10W7,11W8,10W8,11W9,10W9,11W10,

6、12W11,120.092-0.211-0.4000.1980.2010.190-0.1100.300-0.304-0.099101112-0.2870.1790.344第三题：3.a) Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate students who smoke is 23%. If one-fth of the college students are graduate students and the rest are undergraduat

7、es, what is the probability that a student who smokes is a graduate student?b) Given the information in part (a), is a randomly chosen college student more likely to be a graduate or undergraduate student?c) Suppose 30% of the graduate students live in a dorm but only 10% of the undergraduate studen

8、ts live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke.答：a) 定义：A=A1 ,A2其中A1表示没有毕业的学生，A2表示毕业的学生，B表示抽烟则由题意而知：P(B|A1)=15% P(B|A2)=23% P(A1)= P(A

9、2)= 则问题则是求P(A2|B)由则b) 由a可以看出随机抽取一个抽烟的大学生，是毕业生的概率是0.277，未毕业的学生是0.723，所以有很大的可能性是未毕业的学生。c) 设住在宿舍为事件C则P(C|A2)=30% P(C|A1)=10% =0.4所以由上面的结果可以看出是毕业生的概率大一些第四题：4. Suppose that the data mining task is to cluster the following ten points (with(x, y, z) representing location) into three clusters:A1(4,2,5), A2(

10、10,5,2), A3(5,8,7), B1(1,1,1), B2(2,3,2), B3(3,6,9), C1(11,9,2), C2(1,4,6), C3(9,1,7), C4(5,6,7)The distance function is Euclidean distance. Suppose initially we assign A1, B1, C1 as the center of each cluster, respectively. Use the K-Means algorithm to show only(a) The three cluster center after th

11、e first round execution(b) The final three clusters答：a) 各点到中心点的欧式距离第一轮：A1B1C1A2549817A34110162B2146165B33393122C21434141C33010093C4217770从而得到的三个簇为：A1, A3,B3,C2, C3, C4 B1,B2 C1,A2所以三个簇新的中心为：(4.5,4.5,6.83)，(1.5,2,1.5)，(10.5,7,2)第二轮：新的簇均值为：(4.5,4.5,6.83)，(1.5,2,1.5)，(10.5,7,2)(4.5,4.5,6.83)(1.5,2,1.5)

12、C1(10.5,7,2)A19.18.576.25A253.8611181.54.25A312.5277878.556.25B158.527781.5127.25B231.861111.588.25B39.74.5106.25C185.86111139.54.25C213.1944424.5115.25C332.5277887.563.25C42.58.556.25所以得到的新的簇为：A1, A3,B3,C2, C3, C4 B1,B2 C1,A2得到的新的簇跟第一轮结束得到的簇的结果相同，不再变化，所以上面的簇是最终的结果。Part II: LabQuestion 1 Assume this

13、 supermarket would like to promote milk. Use the data in “transactions” as training data to build a decision tree (C5.0 algorithm) model to predict whether the customer would buy milk or not. 1. Build a decision tree using data set “transactions” that predicts milk as a function of the other fields.

14、 Set the “type” of each field to “Flag”, set the “direction” of “milk” as “out”, set the “type” of COD as “Typeless”, select “Expert” and set the “pruning severity” to 65, and set the “minimum records per child branch” to be 95. Hand-in: A figure showing your tree.2. Use the model (the full tree gen

15、erated by Clementine in step 1 above) to make a prediction for each of the 20 customers in the “rollout” data to determine whether the customer would buy milk. Hand-in: your prediction for each of the 20 customers.3. Hand-in: rules for positive (yes) prediction of milk purchase identified from the d

16、ecision tree (up to the fifth level. The root is considered as level 1). Compare with the rules generated by Apriori in Homework 1, and submit your brief comments on the rules (e.g., pruning effect)答：1生成的决策树为：生成的决策树模型为：juices = 1 Mode: 1 water = 1 Mode: 1 = 1 water = 0 Mode: 0 pasta = 1 Mode: 1 = 1

17、pasta = 0 Mode: 0 tomato souce = 1 Mode: 1 = 1 tomato souce = 0 Mode: 0 biscuits = 1 Mode: 1 = 1 biscuits = 0 Mode: 0 = 0 juices = 0 Mode: 0 yoghurt = 1 Mode: 1 water = 1 Mode: 1 = 1 water = 0 Mode: 0 biscuits = 1 Mode: 1 = 1 biscuits = 0 Mode: 0 brioches = 1 Mode: 1 = 1 brioches = 0 Mode: 0 beer =

18、1 Mode: 1 = 1 beer = 0 Mode: 0 = 0 yoghurt = 0 Mode: 0 beer = 1 Mode: 0 biscuits = 1 Mode: 1 = 1 biscuits = 0 Mode: 0 rice = 1 Mode: 1 = 1 rice = 0 Mode: 0 coffee = 1 Mode: 1 water = 1 Mode: 1 = 1 water = 0 Mode: 0 = 0 coffee = 0 Mode: 0 = 0 beer = 0 Mode: 0 frozen vegetables = 1 Mode: 0 biscuits =

19、1 Mode: 1 pasta = 1 Mode: 1 = 1 pasta = 0 Mode: 0 = 0 biscuits = 0 Mode: 0 oil = 1 Mode: 1 = 1 oil = 0 Mode: 0 brioches = 1 Mode: 0 water = 1 Mode: 1 = 1 water = 0 Mode: 0 = 0 brioches = 0 Mode: 0 = 0 frozen vegetables = 0 Mode: 0 pasta = 1 Mode: 0 mozzarella = 1 Mode: 1 = 1 mozzarella = 0 Mode: 0 w

20、ater = 1 Mode: 1 biscuits = 1 Mode: 1 = 1 biscuits = 0 Mode: 0 brioches = 1 Mode: 1 = 1 brioches = 0 Mode: 0 coffee = 1 Mode: 1 = 1 coffee = 0 Mode: 0 = 0 water = 0 Mode: 0 coke = 1 Mode: 0 coffee = 1 Mode: 1 = 1 coffee = 0 Mode: 0 = 0 coke = 0 Mode: 0 = 0 pasta = 0 Mode: 0 water = 1 Mode: 0 coffee

21、= 1 Mode: 1 = 1 coffee = 0 Mode: 0 = 0 water = 0 Mode: 1 rice = 1 Mode: 0 = 0 rice = 0 Mode: 1 tunny = 1 Mode: 0 biscuits = 1 Mode: 1 = 1 biscuits = 0 Mode: 0 = 0 tunny = 0 Mode: 1 brioches = 1 Mode: 0 = 0 brioches = 0 Mode: 1 coke = 1 Mode: 0 = 0 coke = 0 Mode: 1 coffee = 1 Mode: 0 = 0 coffee = 0 M

22、ode: 1 biscuits = 1 Mode: 0 = 0 biscuits = 0 Mode: 1 oil = 1 Mode: 0 = 0 oil = 0 Mode: 1 tomato souce = 1 Mode: 0 = 0 tomato souce = 0 Mode: 1 mozzarella = 1 Mode: 0 = 0 mozzarella = 0 Mode: 1 crackers = 1 Mode: 0 = 0 crackers = 0 Mode: 1 frozen fish = 1 Mode: 0 = 0 frozen fish = 0 Mode: 1 = 12按照1中生

23、成的据册数进行预测的结果：4. 生成的关联规则为：Question 2: Churn ManagementThe goal of this assignment is to introduce churn management using decision trees, logistic regression and neural network. You will try different combinations of the parameters to see their impacts on the accuracy of your models for this specific

24、data set. This data set contains summarized data records for each customer for a phone company. Our goal is to build a model so that this company can predict potential churners. Two data sets are available, churn_training.txt and churn_validation.txt. Each data set has 21 variables. They are:(1) 决策树生成的混淆矩阵如下混淆矩阵101212209981(2) 神经网络生成的混淆矩阵如下混淆矩阵1013112024966(3) Logistic 回归生成的混淆矩阵如下混淆矩阵1011132034956(4) 从下面的预测准确性可以看出决策树和神经网络对分类规则的预测更加准确.度量decision treeneural networklogistic regression准确率96.70%96.51%93.61%错误率3.00%3.49%6.39%专心-专注-专业

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

20 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 数据挖掘第二次作业 11

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：数据挖掘第二次作业(共11页).doc
链接地址：https://www.taowenge.com/p-14050427.html