第4课数据分类和预测.ppt
《第4课数据分类和预测.ppt》由会员分享,可在线阅读,更多相关《第4课数据分类和预测.ppt(44页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、第4课数据分类和预测 Still waters run deep.流静水深流静水深,人静心深人静心深 Where there is life,there is hope。有生命必有希望。有生命必有希望内容提纲nWhat is classification?What is prediction?nIssues regarding classification and predictionnClassification by decision tree inductionnBayesian ClassificationnPredictionnSummarynReferencenClassifica
2、tion predicts categorical class labels(discrete or nominal)classifies data(constructs a model)based on the training set and the values(class labels)in a classifying attribute and uses it in classifying new datanPrediction models continuous-valued functions,i.e.,predicts unknown or missing values nTy
3、pical applicationsCredit approvalTarget marketingMedical diagnosisFraud detectionI.Classification vs.PredictionClassificationA Two-Step Process nModel construction:describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class,as determined by the class label a
4、ttributeThe set of tuples used for model construction is training setThe model is represented as classification rules,decision trees,or mathematical formulaenModel usage:for classifying future or unknown objectsEstimate accuracy of the modelnThe known label of test sample is compared with the classi
5、fied result from the modelnAccuracy rate is the percentage of test set samples that are correctly classified by the modelnTest set is independent of training set,otherwise over-fitting will occurIf the accuracy is acceptable,use the model to classify data tuples whose class labels are not knownClass
6、ification Process(1):Model ConstructionTrainingDataClassificationAlgorithmsIF rank=professorOR years 6THEN tenured=yes Classifier(Model)Classification Process(2):Use the Model in PredictionClassifierTestingDataUnseen Data(Jeff,Professor,4)Tenured?Supervised vs.Unsupervised LearningnSupervised learni
7、ng(classification)Supervision:The training data(observations,measurements,etc.)are accompanied by labels indicating the class of the observationsNew data is classified based on the training setnUnsupervised learning(clustering)The class labels of training data is unknownGiven a set of measurements,o
8、bservations,etc.with the aim of establishing the existence of classes or clusters in the dataII.Issues Regarding Classification and Prediction(1):Data PreparationnData cleaningPreprocess data in order to reduce noise and handle missing valuesnRelevance analysis(feature selection)Remove the irrelevan
9、t or redundant attributesnData transformationGeneralize and/or normalize dataIssues regarding classification and prediction(2):Evaluating classification methodsnAccuracy:classifier accuracy and predictor accuracynSpeed and scalabilitytime to construct the model(training time)time to use the model(cl
10、assification/prediction time)nRobustnesshandling noise and missing valuesnScalabilityefficiency in disk-resident databases nInterpretabilityunderstanding and insight provided by the modelnOther measures,e.g.,goodness of rules,such as decision tree size or compactness of classification rulesIII.Decis
11、ion Tree Induction:Training DatasetThis follows an example of Quinlans ID3(Playing Tennis)Output:A Decision Tree for“buys_computer”age?overcaststudent?credit rating?noyesfairexcellent40nonoyesyesyes30.40Algorithm for Decision Tree InductionnBasic algorithm(a greedy algorithm)Tree is constructed in a
12、 top-down recursive divide-and-conquer mannerAt start,all the training examples are at the rootAttributes are categorical(if continuous-valued,they are discretized in advance)Examples are partitioned recursively based on selected attributesTest attributes are selected on the basis of a heuristic or
13、statistical measure(e.g.,information gain)nConditions for stopping partitioningAll samples for a given node belong to the same classThere are no remaining attributes for further partitioning majority voting is employed for classifying the leafThere are no samples leftAttribute Selection Measure:Info
14、rmation Gain(ID3/C4.5)nSelect the attribute with the highest information gainnS contains si tuples of class Ci for i=1,m ninformation measures info required to classify any arbitrary tuplenentropy of attribute A with values a1,a2,avninformation gained by branching on attribute AAttribute Selection b
15、y Information Gain ComputationgClass P:buys_computer=“yes”gClass N:buys_computer=“no”gI(p,n)=I(9,5)=0.940gCompute the entropy for age:means“age split-pointExtracting Classification Rules from TreesnRepresent the knowledge in the form of IF-THEN rulesnOne rule is created for each path from the root t
16、o a leafnEach attribute-value pair along a path forms a conjunctionnThe leaf node holds the class predictionnRules are easier for humans to understandnExampleIF age=“=30”AND student=“no”THEN buys_computer=“no”IF age=“40”AND credit_rating=“excellent”THEN buys_computer=“yes”IF age=“=30”AND credit_rati
17、ng=“fair”THEN buys_computer=“no”Avoid Overfitting in ClassificationnOverfitting:An induced tree may overfit the training data Too many branches,some may reflect anomalies due to noise or outliersPoor accuracy for unseen samplesnTwo approaches to avoid overfitting Prepruning:Halt tree construction ea
18、rlydo not split a node if this would result in the goodness measure falling below a thresholdnDifficult to choose an appropriate thresholdPostpruning:Remove branches from a“fully grown”treeget a sequence of progressively pruned treesnUse a set of data different from the training data to decide which
19、 is the“best pruned tree”Approaches to Determine the Final Tree SizenSeparate training(2/3)and testing(1/3)setsnUse cross validationnUse all the data for trainingbut apply a statistical test(e.g.,chi-square)to estimate whether expanding or pruning a node may improve the entire distributionnEnhanceme
20、nts to Basic Decision Tree InductionnAllow for continuous-valued attributesDynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervalsnHandle missing attribute valuesAssign the most common value of the attributeAssign probability t
21、o each of the possible valuesnAttribute constructionCreate new attributes based on existing ones that are sparsely representedThis reduces fragmentation,repetition,and replicationClassification in Large DatabasesnClassificationa classical problem extensively studied by statisticians and machine lear
22、ning researchersnScalability:Classifying data sets with millions of examples and hundreds of attributes with reasonable speednWhy decision tree induction in data mining?relatively faster learning speed(than other classification methods)convertible to simple and easy to understand classification rule
23、scan use SQL queries for accessing databasescomparable classification accuracy with other methodsScalable Decision Tree Induction MethodsnSLIQ(EDBT96 Mehta et al.)builds an index for each attribute and only class list and the current attribute list reside in memorynSPRINT(VLDB96 J.Shafer et al.)cons
24、tructs an attribute list data structure nPUBLIC(VLDB98 Rastogi&Shim)integrates tree splitting and tree pruning:stop growing the tree earliernRainForest (VLDB98 Gehrke,Ramakrishnan&Ganti)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list(attri
25、bute,value,class label)Presentation of Classification ResultsVisualization of a Decision Tree in SGI/MineSet 3.0Interactive Visual Mining by Perception-Based Classification(PBC)IV.Bayesian Classification:Why?nProbabilistic learning:Calculate explicit probabilities for hypothesis,among the most pract
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据 分类 预测
限制150内