书签分享收藏举报版权申诉 / 17

立即下载

当前位置：首页 > 教育专区 > 小学资料 > 毕业论文外文翻译-数据挖掘—聚类分析.doc

毕业论文外文翻译-数据挖掘—聚类分析.doc

上传人：豆****

文档编号：29917877

上传时间：2022-08-02

格式：DOC

页数：17

大小：211KB

( 4.5 )

《毕业论文外文翻译-数据挖掘—聚类分析.doc》由会员分享，可在线阅读，更多相关《毕业论文外文翻译-数据挖掘—聚类分析.doc（17页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、电气信息工程学院外文翻译英文名称： Data mining-clustering 译文名称：数据挖掘聚类分析专业：自动化姓名： * 班级学号： * 指导教师： * 译文出处： Data mining：Ian H.Witten, Eibe Frank 著 Clustering5.1 INTRODUCTION Clustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead,

2、 the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are di

3、fferent. Many definitions for clusters have been proposed: l Set of like elements. Elements from different clusters are not alike. l The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it.A term similar to clustering is database seg

4、mentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clus

5、tering is found in Example 5.1. This example illustrates the fact that that determining how to do the clustering is not straightforward. As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first floor ty

6、pe of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.Clustering has been used in many application domains, including biology, medicine, anthropology,

7、 marketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent uses include examining Web log data

8、 to detect usage patterns.When clustering is applied to a real-world database, many interesting problems occur:l Outlier handling is difficult. Here the elements do not naturally fall into any cluster. They can be viewed as solitary clusters. However, if a clustering algorithm attempts to find large

9、r clusters, these outliers will be forced to be placed in some cluster. This process may result in the creation of poor clusters by combining two existing clusters and leaving the outlier in its own cluster.l Dynamic data in the database implies that cluster membership may change over time.l Interpr

10、eting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may not be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not b

11、e obvious. Here is where a domain expert is needed to assign a label or interpretation for each cluster.l There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may be required

12、. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created. l Another related iss

13、ue is what data should be used of clustering. Unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as

14、similar to unsupervised learning.We can then summarize some basic features of clustering (as opposed to classification):l The (best) number of clusters is not known.l There may not be any a priori knowledge concerning the clusters.l Cluster results are dynamic.The clustering problem is stated as sho

15、wn in Definition 5.1. Here we assume that the number of clusters to be created is an input value, k. The actual content (and interpretation) of each cluster, is determined as a result of the function definition. Without loss of generality, we will view that the result of solving a clustering problem

16、 is that a set of clusters is created: K=.DEFINITION 5.1.Given a database D= of tuples and an integer value k, the clustering problem is to define a mapping f: where each is assigned to one cluster ,. A cluster, contains precisely those tuples mapped to it; that is, =and . A classification of the di

17、fferent types of clustering algorithms is shown in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest level, each item

18、is in its own unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to dri

19、ve how the final set is created. Traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .There are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. Algorithms targeted to la

20、rger databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. Clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping clus

21、ters. Even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. In turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. Extrinsic techniques use labeling of the items to assist in the classification process. These algorithms are the

22、 traditional classification supervised learning algorithms in which a special input training set is used. Intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. All algorithms we examine in this chapter fall into

23、 the intrinsic class.The types of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can be categorized as agglomerative or divisive. ”Agglomerative” implies that the clusters are created in a bottom-up fashion, while divisive algori

24、thms work in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. Another descriptive tag indicates whether each individual element is handled one by one,

25、 serial (sometimes called incremental), or whether all items are examined together, simultaneous. If a specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. As is usually done with de

26、cision tree classification techniques, some algorithms examine attribute values one at a time, monothetic. Polythetic algorithms consider all attribute values at one time. Finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or ma

27、trix algebra. In this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure. We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that

28、 have been proposed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers. 5.2 SIMILARITY AND DISTANCE MEASURESThere are many desirable properties for the clusters created by a solution to a specific clustering problem

29、. The most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim(), defined between any two tuples, . This provides a more strict and altern

30、ative clustering definition, as found in Definition 5.2. Unless otherwise stated, we use the first definition rather than the second. Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property.A distance measure, dis(),

31、as opposed to similarity, is often used in clustering. The clustering problem then has the desirable property that given a cluster, and .Some clustering algorithms look only at numeric data, usually assuming metric data points. Metric attributes satisfy the triangular inequality. The cluster can the

32、n be described by using several characteristic values. Given a cluster, of N points , we make the following definitions ZRL96:Here the centroid is the “middle” of the cluster; it need not be an actual point in the cluster. Some clustering algorithms alternatively assume that the cluster is represent

33、ed by one centrally located object in the cluster called a medoid. The radius is the square root of the average mean squared distance from any point in the cluster to the centroid, and of points in the cluster. We use the notation to indicate the medoid for cluster.Many clustering algorithms require

34、 that the distance between clusters (rather than elements) be determined. This is not an easy task given that there are many interpretations for distance between clusters. Given clusters and, there are several standard alternatives to calculate the distance between clusters. A representative list is

35、:l Single link: Smallest distance between an element in one cluster and an element in the other. We thus have dis()=and.l Complete link: Largest distance between an element in one cluster and an element in the other. We thus have dis()=and.l Average: Average distance between an element in one cluste

36、r and an element in the other. We thus have dis()=and.l Centroid: If cluster have a representative centroid, then the centroid distance is defined as the distance between the centroids. We thus have dis()=dis(), whereis the centroid forand similarly for .l Medoid: Using a medoid to represent each cl

37、uster, the distance between the clusters can be defined by the distance between the medoids: dis()=5.3 OUTLIERSAs mentioned earlier, outliers are sample points with values much different from those of the remaining set of data. Outliers may represent errors in the data (perhaps a malfunctioning sens

38、or recorded an incorrect data value) or could be correct data values that are simply much different from the remaining data. A person who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals, this value probably would be viewed as an outlier.Some clustering tech

39、niques do not perform well with the presence of outliers. This problem is illustrated in Figure 5.3. Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data will be plac

40、ed in one cluster because they are closer together than the outlier. This problem is complicated by the fact that many clustering algorithms actually have as input the number of desired clusters to be found.Clustering algorithms may actually find and remove outliers to ensure that they perform bette

41、r. However, care must be taken in actually removing outliers. For example, suppose that the data mining problem is to predict flooding. Extremely high water level values occur very infrequently, and when compared with the normal water level values may seem to be outliers. However, removing these val

42、ues may not allow the data mining algorithms to work effectively because there would be no data that showed that floods ever actually occurred.Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data mining, algorithms may then choose t

43、o remove or treat these values differently. Some outlier detection techniques are based on statistical techniques. These usually assume that the set of data follows a known distribution and that outliers can be detected by well-known tests such as discordancy tests. However, these tests are not very

44、 realistic for real-world data because real-world data values may not follow well-defined data distributions. Also, most of these tests assume single attribute value, and many attributes are involved in real-world datasets. Alternative detection techniques may be based on distance measures.聚类分析5.1 I

45、NTRODUCTION 5.1简介 Clustering is similar to classification in that data are grouped.聚类分析与分类数据分组类似。然而，与数据分类不同的是，所分的组预先是不确定的。相反，分组是根据在实际数据中发现的特点通过寻找数据之间的相关性来实现的。这些组被称为聚类。一些作者认为聚类分析作为一种特殊类型的分类。但是，在本文两个不同的观点中我们遵循更传统的看法。提出了许多有关聚类的定义：类似元素的集合Set of like elements. Elements from different clusters are not al

46、类类。不同聚类中的元素是不一样的。 The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it.在聚类中的点之间的距离比在聚类中的一个点和聚类之外任何一点之间的距离要小。 A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. 与聚类类

47、似的术语是数据库分割，其中数据库中的元组（记录）被放在一起。 This is done to partition or segment the database into components that then give the user a more general view of the data.这样做是为了分割或划分成数据的数据库组件，然后给用户一个普遍的看法。这样本文In this case text, we do not differentiate between segmentation and clusterin这样本本我们就不区分分割和聚类。A simple exampl

48、e of clustering is found in Example 5.1.This example illustrates the fact that that determining how to do the clustering is not straightforwar一个简单聚类分析的例子见例5.1.这个例子说明了决定如何做聚类并不是容易的。As illustrated in Figure 5.1,a given set of data may be clustered on different attributes. 正如图5.1所示，一个给定的数据集合可能汇聚不同的属性。Here a group of homes in a geographic area is show这里显示了一个地域的住宅群。The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.一楼的聚类类型是

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 毕业论文外文翻译数据挖掘聚类分析

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：毕业论文外文翻译-数据挖掘—聚类分析.doc
链接地址：https://www.taowenge.com/p-29917877.html