CLIQUE算法的基本思路.ppt
CLIQUE算法的基本思路n采用基于密度的算法n聚类(cluster)就是一个区域,满足该区域中的点的密度大于与之相邻的区域。n把数据空间分割成网格单元(unit),将落到某个单元中的点的个数当成这个单元的密度(density)。可以指定一个数值,当某个单元中的点的个数大于该数值时,我们就说这个单元格是稠密(dense)的。聚类也就定义为连通的所有的稠密单元格的集合。基本概念n设 A=A1,A2,Ad 是 n个 域 的 集 合,那 么S=A1A2Ad就是一个d维空间,我们将A1,A2,Ad看成是S的维(属性);n算法的输入是一个n维空间中的点集,设为V=v1,v2,vm,其中vi=vi1,vi2,vid。vi的第j个分量vijAj;n通过一个输入参数,可以将空间S的每一维分成相同的个区间,从而将整个空间分成了有限个不相交的类矩形单元(units),每一个这样的矩形单元可以描述为u1,u2,ud,其中ui=li,hi)是一个前闭后开区间;基本概念n一 个 点 v=v1,v2,vd 落 入 一 个 单 元u=u1,u2,ud中,当且仅当对于每一个ui都有li=vi。密度阈值是另一个输入参数;基本概念n对于S的任何子空间,例如子空间Sub=At1At2Atk,(kd,并且当ij时有titj成立),可以在该子空间中定义单元格,选择率等相同概念。基本概念n一个聚类(cluster)可以定义为,在k维空间中由一些连通的稠密单元组成的最大单元集;n两个k维中的单元格u1,u2称为连通的(connected)当且仅当:(1)这两个单元格有一个公共的面;或者(2)u1,u2都跟另一个单元格u3连通;n两个单元格u1=rt1,rt2,rtk,u2=rt1,rt2,rtk有一个公共的面是指,存在k-1个维度(不妨设这k-1维就是At1,At2,Atk-1),有rtj=rtj成立(j=1,2,k-1),并且对于第Atk维有htk=ltk,或者htk=ltk成立;基本概念n区域(region)是指一个每一边都与坐标轴平行的类矩形。也就是说这类区域是由单元格组成的且具有规则的形状,这样一个区域就可以用区间的交的形式表示出来;n区域R包含于一个聚类C,当且仅当RC=R;进一步我们称这样的R是最大的(maximal)当且仅当没有一个R的超集R也包含于C;n一个聚类C的最小描述是上述最大区域(maximal region)的一个集合R,R中的最大区域刚好覆盖C,集合r中的最大区域是没有冗余的,即R的任何子集都不能覆盖C;例子nd-demensional spacenNumber of intervalsnunitnselectivity of a unitndensity threshold nDense unitnClusternRegion nmaximal regionnminimal description of a cluster例子subspace问题描述nGiven a set of data points and the input parameters,and,find clusters in all subspaces of the original data space and present a mimimal description of each cluster in the form of a DNF expression.CLIQUE算法nIdentification of subspace that contain clustersnIdentification of clustersnGeneration of minimal description for the clusters第一步:识别含有聚类的子空间nA bottom-up algorithm to find dense unitsnDetermines 1-dimensional dense units by making a pass over the datanHaving determined(k-1)-dimensional dense units,the candidate k-dimensional units are determined using candidate generation procedure.nMDL-based pruningnTo decide which subspaces(and the corresponding dense units)are interesting.nMDL-Minimal Description Lengthcandidate generation procedurenInput:Dk-1,the set of all(k-1)-dimensional dense unitnOutput:a superset of the set of all k-dimensional dense unitsnAlgorithm:MDL-based pruningnCoverage of subspace sjnSort the subspaces in the descending order of their coveragenDivide the sorted list of subspaces into two sets:the selected set I and the pruned set PnHow to arrive at the cut pointMDL-based pruningnThe code length is minimized to determine the optimal cut point iMDL-based pruning第二步:识别聚类nInput:a set of dense units D,all in the same k-dimensional space SnOutput:a partition of D into D 1,D q,such that all units in D i are connected and no two units u iD i,u jD j with ij are connected.Each such partition is a clusternMethod:depth-first search algorithmnStart with some unit u in D,assign it the first cluster number,and find all the units it is connected tonIf there still are units in D that have not yet been visited,find one and repeat the procedure.depth-first search algorithm第三步:产生最小聚类描述nInput:disjoint sets of connected k-dimensional units in the same subspace,each such set is a clusternOutput:a concise description for each clusternMethod:nCovering with maximal regionsnMinimal coverConcept:Cover of a cluster nFor a cluster C in a k-dimensional subspace S,a set W of regions in the same subspace S is a cover of C if every region RW is contained in C,and each unit in C is contained in at least one of the region in W.1.Covering with maximal regionsnInput:a set C of connected dense units in the same k-dimensional space SnOutput:a set W of maximal region such that W is a cover of CnMethod:Greedy growth algorithmGreedy growth algorithmnBegin with an arbitrary dense unit u1 C and greedily grow a maximal region R1 that covers u1.Add R1 to WnFind another unit u2 C that is not yet covered by any of maximal region in W.greedily grow a maximal region R2 that covers u2.Add R2 to WnRepeat this procedure until all units in C are covered by some maximal region in RObtain a maximal region covering a dense unit unStart with u and grow it along dimension a1,as much as possible in both directions(to the left and to the right of the unit),using connected dense units contained in CnGrow this region along a2nRepeated for all the dimensions,yielding a maximal region covering u2.minimal covernInput:a cover for each clusternOutput:a minimal cover (minimality is defined in terms of the number of maximal regions required to cover the cluster)nMethod:nRemove from the cover the smallest(in number of units)maximal region which is redundantnRepeat the procedure until no maximal region can be removed.算法小结第1步:根据delta的值将原数据表的每一维划分成相等的区间;将每一维上区间的定义保存到“Interval_Define”表中;第2步:n=1;这时所有单元都为候选稠密单元;第3步:扫描原数据表,找出n维子空间中落在每个候选稠密单元的数据点数;第4步:根据select thresh的值找出n维子空间中的稠密单元;第5步:用MDL-based算法修剪子空间;第6步:由n维子空间中的稠密单元集求出n+1维子空间中的侯选稠密单元集,若n+1维子空间中的侯选稠密单元集不为空,跳转第3步第6步:用depth-first-search algorithm找出n维空间中的聚类;第7步:用greedy growth algorithm求覆盖每个聚类的最大区域集;第8步:求出每个聚类的最小覆盖;第9步:将聚类信息保存到“Minning_Result_XB”表中。