《数据仓库与数据挖掘》第9章603.pptx





《《数据仓库与数据挖掘》第9章603.pptx》由会员分享,可在线阅读,更多相关《《数据仓库与数据挖掘》第9章603.pptx(87页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、第第7 7章章:分类和预测分类和预测nWhat is classification?What is prediction?nIssues regarding classification and predictionnClassification by decision tree inductionnBayesian ClassificationnClassification by Neural NetworksnClassification by Support Vector Machines(SVM)nClassification based on concepts from associ
2、ation rule miningnOther Classification MethodsnPredictionnClassification accuracynSummary2023/3/131Data Mining:Concepts and TechniquesnClassification:npredicts categorical class labels(discrete or nominal)nclassifies data(constructs a model)based on the training set and the values(class labels)in a
3、classifying attribute and uses it in classifying new datanPrediction:nmodels continuous-valued functions,i.e.,predicts unknown or missing values nTypical Applicationsncredit approvalntarget marketingnmedical diagnosisntreatment effectiveness analysisClassification vs.Prediction2023/3/132Data Mining:
4、Concepts and TechniquesClassificationA Two-Step Process nModel construction:describing a set of predetermined classesnEach tuple/sample is assumed to belong to a predefined class,as determined by the class label attributenThe set of tuples used for model construction is training setnThe model is rep
5、resented as classification rules,decision trees,or mathematical formulaenModel usage:for classifying future or unknown objectsnEstimate accuracy of the modelnThe known label of test sample is compared with the classified result from the modelnAccuracy rate is the percentage of test set samples that
6、are correctly classified by the modelnTest set is independent of training set,otherwise over-fitting will occurnIf the accuracy is acceptable,use the model to classify data tuples whose class labels are not known2023/3/133Data Mining:Concepts and TechniquesClassification Process(1):Model Constructio
7、nTrainingDataClassificationAlgorithmsIF rank=professorOR years 6THEN tenured=yes Classifier(Model)2023/3/134Data Mining:Concepts and TechniquesClassification Process(2):Use the Model in PredictionClassifierTestingDataUnseen Data(Jeff,Professor,4)Tenured?2023/3/135Data Mining:Concepts and TechniquesS
8、upervised vs.Unsupervised LearningnSupervised learning(classification)nSupervision:The training data(observations,measurements,etc.)are accompanied by labels indicating the class of the observationsnNew data is classified based on the training setnUnsupervised learning(clustering)nThe class labels o
9、f training data is unknownnGiven a set of measurements,observations,etc.with the aim of establishing the existence of classes or clusters in the data2023/3/136Data Mining:Concepts and Techniques第第7 7章章:分类和预测分类和预测nWhat is classification?What is prediction?nIssues regarding classification and predicti
10、onnClassification by decision tree inductionnBayesian ClassificationnClassification by Neural NetworksnClassification by Support Vector Machines(SVM)nClassification based on concepts from association rule miningnOther Classification MethodsnPredictionnClassification accuracynSummary2023/3/137Data Mi
11、ning:Concepts and TechniquesIssues Regarding Classification and Prediction(1):Data PreparationnData cleaningnPreprocess data in order to reduce noise and handle missing valuesnRelevance analysis(feature selection)nRemove the irrelevant or redundant attributesnData transformationnGeneralize and/or no
12、rmalize data2023/3/138Data Mining:Concepts and TechniquesIssues regarding classification and prediction(2):Evaluating Classification MethodsnPredictive accuracynSpeed and scalabilityntime to construct the modelntime to use the modelnRobustnessnhandling noise and missing valuesnScalabilitynefficiency
13、 in disk-resident databases nInterpretability:nunderstanding and insight provided by the modelnGoodness of rulesndecision tree sizencompactness of classification rules2023/3/139Data Mining:Concepts and Techniques第第7 7章章:分类和预测分类和预测nWhat is classification?What is prediction?nIssues regarding classific
14、ation and predictionnClassification by decision tree inductionnBayesian ClassificationnClassification by Neural NetworksnClassification by Support Vector Machines(SVM)nClassification based on concepts from association rule miningnOther Classification MethodsnPredictionnClassification accuracynSummar
15、y2023/3/1310Data Mining:Concepts and TechniquesTraining DatasetThis follows an example from Quinlans ID32023/3/1311Data Mining:Concepts and TechniquesOutput:A Decision Tree for“buys_computer”age?overcaststudent?credit rating?noyesfairexcellent40nonoyesyesyes30.402023/3/1312Data Mining:Concepts and T
16、echniquesAlgorithm for Decision Tree InductionnBasic algorithm(a greedy algorithm)nTree is constructed in a top-down recursive divide-and-conquer mannernAt start,all the training examples are at the rootnAttributes are categorical(if continuous-valued,they are discretized in advance)nExamples are pa
17、rtitioned recursively based on selected attributesnTest attributes are selected on the basis of a heuristic or statistical measure(e.g.,information gain)nConditions for stopping partitioningnAll samples for a given node belong to the same classnThere are no remaining attributes for further partition
18、ing majority voting is employed for classifying the leafnThere are no samples left2023/3/1313Data Mining:Concepts and TechniquesAttribute Selection Measure:Information Gain(ID3/C4.5)nSelect the attribute with the highest information gainnS contains si tuples of class Ci for i=1,m ninformation measur
19、es info required to classify any arbitrary tuplenentropy of attribute A with values a1,a2,avninformation gained by branching on attribute A2023/3/1314Data Mining:Concepts and TechniquesAttribute Selection by Information Gain ComputationgClass P:buys_computer=“yes”gClass N:buys_computer=“no”gI(p,n)=I
20、(9,5)=0.940gCompute the entropy for age:means“age=30”has 5 out of 14 samples,with 2 yeses and 3 nos.HenceSimilarly,2023/3/1315Data Mining:Concepts and TechniquesOther Attribute Selection MeasuresnGini index(CART,IBM IntelligentMiner)nAll attributes are assumed continuous-valuednAssume there exist se
21、veral possible split values for each attributenMay need other tools,such as clustering,to get the possible split valuesnCan be modified for categorical attributes2023/3/1316Data Mining:Concepts and TechniquesGini Index(IBM IntelligentMiner)nIf a data set T contains examples from n classes,gini index
22、,gini(T)is defined as where pj is the relative frequency of class j in T.nIf a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively,the gini index of the split data contains examples from n classes,the gini index gini(T)is defined asnThe attribute provides the smallest gi
23、nisplit(T)is chosen to split the node(need to enumerate all possible splitting points for each attribute).2023/3/1317Data Mining:Concepts and TechniquesExtracting Classification Rules from TreesnRepresent the knowledge in the form of IF-THEN rulesnOne rule is created for each path from the root to a
24、 leafnEach attribute-value pair along a path forms a conjunctionnThe leaf node holds the class predictionnRules are easier for humans to understandnExampleIF age=“=30”AND student=“no”THEN buys_computer=“no”IF age=“40”AND credit_rating=“excellent”THEN buys_computer=“yes”IF age=“=30”AND credit_rating=
25、“fair”THEN buys_computer=“no”2023/3/1318Data Mining:Concepts and TechniquesAvoid Overfitting in ClassificationnOverfitting:An induced tree may overfit the training data nToo many branches,some may reflect anomalies due to noise or outliersnPoor accuracy for unseen samplesnTwo approaches to avoid ove
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据仓库与数据挖掘 数据仓库 数据 挖掘 603

限制150内