毕业论文外文翻译-数据挖掘—聚类分析.doc
《毕业论文外文翻译-数据挖掘—聚类分析.doc》由会员分享,可在线阅读,更多相关《毕业论文外文翻译-数据挖掘—聚类分析.doc(17页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、电气信息工程学院外 文 翻 译英文名称: Data mining-clustering 译文名称: 数据挖掘聚类分析 专 业: 自动化 姓 名: * 班级学号: * 指导教师: * 译文出处: Data mining:Ian H.Witten, Eibe Frank 著 Clustering5.1 INTRODUCTION Clustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead,
2、 the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are di
3、fferent. Many definitions for clusters have been proposed: l Set of like elements. Elements from different clusters are not alike. l The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it.A term similar to clustering is database seg
4、mentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clus
5、tering is found in Example 5.1. This example illustrates the fact that that determining how to do the clustering is not straightforward. As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first floor ty
6、pe of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.Clustering has been used in many application domains, including biology, medicine, anthropology,
7、 marketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent uses include examining Web log data
8、 to detect usage patterns.When clustering is applied to a real-world database, many interesting problems occur:l Outlier handling is difficult. Here the elements do not naturally fall into any cluster. They can be viewed as solitary clusters. However, if a clustering algorithm attempts to find large
9、r clusters, these outliers will be forced to be placed in some cluster. This process may result in the creation of poor clusters by combining two existing clusters and leaving the outlier in its own cluster.l Dynamic data in the database implies that cluster membership may change over time.l Interpr
10、eting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may not be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not b
11、e obvious. Here is where a domain expert is needed to assign a label or interpretation for each cluster.l There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may be required
12、. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created. l Another related iss
13、ue is what data should be used of clustering. Unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as
14、similar to unsupervised learning.We can then summarize some basic features of clustering (as opposed to classification):l The (best) number of clusters is not known.l There may not be any a priori knowledge concerning the clusters.l Cluster results are dynamic.The clustering problem is stated as sho
15、wn in Definition 5.1. Here we assume that the number of clusters to be created is an input value, k. The actual content (and interpretation) of each cluster, is determined as a result of the function definition. Without loss of generality, we will view that the result of solving a clustering problem
16、 is that a set of clusters is created: K=.DEFINITION 5.1.Given a database D= of tuples and an integer value k, the clustering problem is to define a mapping f: where each is assigned to one cluster ,. A cluster, contains precisely those tuples mapped to it; that is, =and . A classification of the di
17、fferent types of clustering algorithms is shown in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest level, each item
18、is in its own unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to dri
19、ve how the final set is created. Traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .There are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. Algorithms targeted to la
20、rger databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. Clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping clus
21、ters. Even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. In turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. Extrinsic techniques use labeling of the items to assist in the classification process. These algorithms are the
22、 traditional classification supervised learning algorithms in which a special input training set is used. Intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. All algorithms we examine in this chapter fall into
23、 the intrinsic class.The types of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can be categorized as agglomerative or divisive. ”Agglomerative” implies that the clusters are created in a bottom-up fashion, while divisive algori
24、thms work in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. Another descriptive tag indicates whether each individual element is handled one by one,
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 毕业论文 外文 翻译 数据 挖掘 聚类分析
限制150内