数据挖掘5学习.pptx
2023/3/171What is concept description?What is concept description?Generate descriptions for characterization and comparison of the datathe simplest kind of descriptive data miningsometimes called class description when the concept to be described refers to a class of objectsCharacterization:provide a concise and succinct summarization of the given collection of dataComparison(discrimination):provide descriptions comparing two or more collections of data第1页/共41页2023/3/172Data generalizationData generalization both characterization and discrimination are based on data generalization and summarizationData generalization a process which abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels Data generalization approaches:data cube approach attribute-oriented induction approach第2页/共41页2023/3/173Data cube approachData cube approach The data for analysis are stored in a multidimensional database,or data cube generalization and specialization can be performed on a data cube by roll-up and drill-down this is not an approach for concept description,only for data generalization Limitations:most commercial data cube implementations confine the types of dimensions to simple nonnumeric data and of measures to simple aggregated numeric values concept hierarchies can be automatically generated from numeric data to form numeric dimensions,however,this is a result of recent data mining research and is not available in most commercial systemscannot tell which dimensions should be used and what levels should the generalization reach第3页/共41页2023/3/174Is OLAP enough?Is OLAP enough?OLAPrestricted to certain kinds of attributes and measure typesuser-controlled processConcept descriptioncan handle complex data types of the attributes and their aggregationsa more automated process第4页/共41页2023/3/175Attribute-oriented inductionAttribute-oriented inductionproposed in 1989Y.Cai,N.Cercone,and J.Han,KDD Workshop at IJCAI-89in its initial proposal,AOI is a relational database query-oriented,generalization-based,online data analysis techniquenow data cube and offline precomputation can also be usedcan be used for both characterization and discriminationgeneral idea:collect the task-relevant dataperform generalization by attribute removal or attribute generalizationapply aggregation by merging identical,generalized tuples and accumulating their respective countsinteractive presentation with users第5页/共41页2023/3/176Sketch of AOISketch of AOIData focusing the specification of task-relevant data,whose result is the initial relationData generalization attribute removal if there is a large set of distinct values for an attribute,but either(1)there is no generalization operator on the attribute,or(2)its higher level concepts are expressed in terms of other attributes attribute generalization if there is a large set of distinct values for an attribute,and there exists a set of generalization operators on the attributePresentation第6页/共41页2023/3/177How to control generalization?How to control generalization?the control of how high an attribute should be generalized is quite subjective the control of this process is called attribute generalization controlattribute generalization threshold controlif the number of distinct values in an attribute is greater than the threshold,then further attribute removal or generalization should be performedeither set a generalization threshold for all the attributes,or set one threshold for each attributetypically ranging from 2 to 8generalized relation threshold controlif the number of distinct tuples in a generalized relation is greater than the threshold,then further generalization should be performedset a threshold for the size of the generalized relation typically ranging from 10 to 30第7页/共41页2023/3/178Example of AOIExample of AOIInitial relation Prime generalized relation第8页/共41页2023/3/179RDBMS implementation of AOIRDBMS implementation of AOIInitialRelquery processing of task-relevant data,deriving the initial relationPreGenbased on the analysis of the number of distinct values in each attribute,determine generalization plan for each attribute:removal?or how high to generalize?PrimeGenbased on the PreGen plan,perform generalization to the appropriate level to derive a“prime generalized relation”,accumulating the countsPresentationuser interaction:(1)adjust levels by drilling,(2)pivoting,(3)mapping into rules,cross tabs,visualization presentations第9页/共41页2023/3/1710Data cube implementation of AOI(I)Construct a data cube on-the-fly if either the task-relevant data set is too specific to match any predefined data cube,or it is not very large benefitfacilitate efficient drill-down analysis costincrease response time because of the computation of data cube balancecompute a cube-structured“subprime”relation in which each dimension of the generalized relation is a few levels deeper than the level of the prime relation第10页/共41页2023/3/1711Data cube implementation of AOI(II)Use a predefined data cube if the granularity of the task-relevant data can match that of the predefined data cube and the set of task-relevant data is quite large benefit facilitate attribute analysis,attribute-oriented induction,slicing and dicing,drill-down,and roll-up cost the cost of cube computation,and the nontrivial storage overhead第11页/共41页2023/3/1712Presentation of characterization(I)Generalized relationsales in 1997第12页/共41页2023/3/1713Presentation of characterization(II)Crosstabsales in 1997第13页/共41页2023/3/1714Presentation of characterization(III)Bar chartsales in 1997第14页/共41页2023/3/1715Presentation of characterization(IV)Pie chartsales in 1997第15页/共41页2023/3/1716Presentation of characterization(V)3-D cube第16页/共41页2023/3/1717Presentation of characterization(VI)Quantitative characteristic rule X,item(X)=“computer”(location(X)=“Asia”)t:25.00%location(X)=“Europe”)t:30.00%(location(X)=“North_America”)t:45.00%a logic rule that is associated with quantitative information is called a quantitative rulethe general form of a quantitative characteristic rule is:X,target_class(X)condition1(X)t:w1conditionn(X)t:wnwhere t-weight describes the typicality of each disjunct in the rulecharacteristic rule is necessary condition of the target class第17页/共41页2023/3/1718Why attribute relevance analysis?it is difficult for users to determine which dimensions should be includedattribute relevance analysis is used tofilter out statistically irrelevant or weakly relevant attributesretain or even rank the most relevant attributesclass characterization which includes the analysis of attribute/dimension relevance is called analytical characterizationclass comparison which includes such analysis is called analytical comparison even within the same dimension,different levels of concepts may have different powers for distinguishing a class from others第18页/共41页2023/3/1719How to perform attribute relevanceanalysis?Intuitively,an attribute is considered highly relevant to a given class if it is likely that the values of the attribute may be used to distinguish the class from othersData collectioncollect data for both target class and contrasting classfor class characterization,the contrasting class is taken to be the set of comparable data,i.e.data sharing similar attributes,of other classes in the databaseAnalytical generalizationperform attribute removal and attribute generalization based on the set of provided attribute analytical thresholdRelevant analysissort and then select the most relevant attributes第19页/共41页2023/3/1720Relevance measures The general idea behind attribute relevance analysis is to compute some measure which is used to quantify the relevance of an attribute with respect to a given classpopular measures include:information gaingain ratiogini indexuncertaintycorrelation coefficients第20页/共41页2023/3/1721Information gain measureS:training setSi:training instances of class Ci(i=1,m)aj:values of attribute A(j=1,v)the information needed to correctly classify the training set issuppose attribute A is selected to partition the training set into the subsets S1,S2,Sv,then the entropy of A,i.e.the information needed to classify all the instances in those subsets iswhere Sij is the instances of class Ci that are covered by Sjthen the information gain of selecting A isthe bigger the information gain,the more relevant the attribute A第21页/共41页2023/3/1722Example of information gain(I)Target class:Graduate students(=120)Contrasting class:Undergraduate students(=130)gendermajorbirth_countryage_rangegpacountMFMFMFScience Science Engineering Science Science Engineering Canada Foreign Foreign Foreign Canada Canada 20-25 25-30 25-30 25-30 20-25 20-25 Very_good ExcellentExcellent Excellent Excellent Excellent 16221825 2118gendermajorbirth_countryage_rangegpacountMFMFMFScience Business Business Science Engineering Engineering Foreign Canada Canada Canada Foreign Canada20 20 2020-2520-2520 Very_goodFairFairFair Very_good Excellent 18202224 2224第22页/共41页2023/3/1723Example of information gain(I)Target class:16221825 2118countGSGSGSGSGSGSVery_good ExcellentExcellent Excellent Excellent Excellent 20-25 25-30 25-30 25-30 20-25 20-25 Canada Foreign Foreign Foreign Canada Canada Science Science Engineering Science Science Engineering MFMFMFclassgpaage_rangebirth_countrymajorgender18202224 2224SSSSSSVery_goodFairFairFair Very_good Excellent 20 20 2020-2520-2520 Foreign Canada Canada Canada Foreign CanadaScience Business Business Science Engineering Engineering MFMFMF第23页/共41页2023/3/1724Example of information gain(II)the information needed to correctly classify the training set is suppose attribute major is selected to partition the training setfor major=“Science”:S11=84,S21=42for major=“Engineer”:S12=36,S22=46for major=“Engineer”:S13=0,S23=42 then the entropy of major is第24页/共41页2023/3/1725Example of information gain(III)then the information gain of major is now suppose we use an attribute relevance threshold of 0.1:gender and birth_country are removed as weakly relevant attributesmajor,gpa,and age_range are kept as strong relevant attributes we can also get the information gain of other attributes:第25页/共41页2023/3/1726Mining class comparisonsmost of the techniques developed for characterization can also be used in comparisondata generalization(including attribute removal and attribute generalization)should be performed synchronously among all the classes paring sales in China on Nov.9 with sales in USA in year 2000 is almost meaningless however,the user can over-write such an synchronous comparison with his own choices e.g.the user may want to compare sales in Shanghai with sales in Vietnam第26页/共41页2023/3/1727How to mine comparisons?Data collectioncollect data for both target class and contrasting classRelevance analysisuse attribute relevance analysis for analytical class comparisonSynchronous generalizationcontrolled by user specified dimension thresholdsDrill-down,roll-up and other OLAP operationsadjust the level of abstraction for the resulted descriptionPresentationas the same forms as that for characterization,except the rule form第27页/共41页2023/3/1728Example of mining comparisonsPrime generalized relation for the target class:Graduate studentsPrime generalized relation for the contrasting class:Undergraduate studentsbirth_countryage_rangegpacount%CanadaCanadaCanadaOther20-2525-30Over_30 Over_30 Good Good Very_good Excellent 5.532.32 5.86 4.68 birth_countryage_rangegpacount%CanadaCanadaCanadaOther15-2015-20 25-30 Over_30 Fair Good Good Excellent5.53 4.535.02 0.68第28页/共41页2023/3/1729Quantitative discriminant rulethe general form of a quantitative discriminant rule is:X,Target_class(X)condition1(X)d:w1conditionn(X)d:wnwhere d-weight describes the discriminability of each disjunct in the rulediscriminant rule is sufficient condition of the target classexample:statusbirth_countryage_rangegpacountGraduateCanada25-30Good90UndergraduateCanada 25-30 Good210 X,graduate_student(X)birth_country(X)=“Canada”age_range(X)=“25-30”gpa(X)=“good”d:30%第29页/共41页2023/3/1730Characteristic vs.discriminant ruleCharacteristic rulenecessary condition of the target classt_weight1+t_weightn=100Discriminant rulesufficient condition of the target classcondport1 d_weight1+condportn d_weightn=tclass_port where condporti is the portion of the instances covered by the i-th antecedents of the rule,tclass_port is the portion of the instances belong to the target class第30页/共41页2023/3/1731Quantitative description ruleQuantitative description rulequantitative description rule can also be expressed as a crosstabexample:X,target_class(X)condtion1(X)t:w1,d:w1condtionn(X)t:wn,d:wn X,Europe(X)(item(X)=“TV”)t:25%,d:40%(item(X)=“computer”)t:75%,d:30第31页/共41页2023/3/1732Descriptive statistical measuresDescriptive statistical measuresMotivation:to better understand the data:central tendency,dispersiondispersion:the degree to which numeric data tend to spreadmeasures for central tendency:meanmedianmodemidrangemeasures for dispersion:quartilesvariancestandard deviation第32页/共41页2023/3/1733Measures for central tendencyMeasures for central tendencyMedianmiddle value if odd number of values,or average of the middle two values otherwise(holistic)Modevalue that occurs most frequently in the data set(holistic)unimodal,bimodal,trimodal,multimodal,no modeMidrangethe average of the min and max values(algebraic)Mean weighted arithmetic mean(algebraic)第33页/共41页2023/3/1734Measures for dispersionMeasures for dispersionQuartilesQuartiles(holistic):Q1(25th percentile),Q3(75th percentile)Inter-quartile range(holistic):IQR=Q3-Q1 Five number summary(holistic):min,Q1,median,Q3,max Outlier(holistic):a value higher or lower than 1.5IQRVariance (algebraic)standard deviation the square root of s2(algebraic)第34页/共41页2023/3/1735Graph displays of statistical descriptions(I)Boxplotthe ends of the box are at the quartiles the box length is IQRthe median is marked by a line within the boxwhiskers outside the box extend to the min and max第35页/共41页2023/3/1736Graph displays of statistical descriptions(II)Histogrambar chart is a special case of histogram,where each rectangle is of uniform width第36页/共41页2023/3/1737Graph displays of statistical descriptions(III)Quantile plotfor data xi sorted in increasing order,fi indicates that approximately 100 fi%of the data are smaller than or equal to the value xi第37页/共41页2023/3/1738Graph displays of statistical descriptions(IV)Q-Q plot(Quantile-Quantile plot)the quantiles of one univariate distribution against the corresponding quantiles of another one第38页/共41页2023/3/1739Incremental and parallel mining ofIncremental and parallel mining ofconcept descriptionconcept description Given the huge amounts of data in a database,it is highly preferred to update data mining results incrementally rather than mining from scratch on each database updatesIncremental miningrevision based on newly added data DBgeneralize DB to the same level of abstraction in the generalized relatio