《数据挖掘-毕业论文外文翻译.docx》由会员分享,可在线阅读,更多相关《数据挖掘-毕业论文外文翻译.docx(6页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、本科毕业论文(设计)外文文献翻译外文文献译文数据挖掘2应用领域数据挖掘是广泛用于一系列科学学科和业务场景。一些值得注意的例子:管理、机器学习贝叶斯-伊恩推理、知识获得专家系统、模糊逻辑、神经网络和遗传算法。在日常业务场景的例子:包括航空公司、数据库营销面板数据的研究和创造,基于定制的贸易出版物为数百种不同的用户数据用户组。Piatetsky-Shapiro与相关学术人员在99年提供一个详细的进一步的使用领域的概述。是另一个国际米兰毛利分析美国东部时间在数据挖掘领域的研究。现代成本会计的帮助下软件公司可以对个人所得税进行多维分析物品。由于大量引用(e.g.产品客户托马、销售渠道、区域)和需要的对
2、象的数量检查控制器需要方法自动识别数据模式。在这种情况下,这些模式的组合属性值(e . g。“DIY商店”和“力量训练”)以及(e . g .毛利率)措施。一个公司,开发数据挖掘程序必须也考虑到大量数据参与。即使在一个中型公司是很常见的,成百上千项流入每月损益表。基于案例的推理(CBR)是其中之一数据最小的有趣的例子荷兰国际集团(ING)和机器学习在一起。CBR组件尝试跟踪当前问题的问题已经解决了过去。帮助桌子,协助澄清的问题客户有购买产品,是一个实际的使用类型的过程。尽管一些公司使用帮助台支持他们的电信电话热线,其他人给他们定制-人通过远程数据的直接访问转移。可以非常价值数据挖掘在这种情况下
3、,因为它巩固聚集在成千上万的信息关键的发现个人历史病例老年男性。这种方法的优点是较短的过程,寻找先例可以用来回答当前客户的问题。3方法有许多不同类型的方法分析和分类数据。一些常见的方法包括聚类分析、贝叶斯推理和归纳学习。可以使用聚类分析基于数值以及措施概念聚类的形式。数据挖掘系统的结构天生是非常不同的,当然这些都很常见的:因为分析方法,识别和分析模式,是系统的核心。因为输入可以包括组件原始数据等信息数据字典、知识的使用场景中,或缩小用户条目搜索过程。因为输出包含发现措施,规则或信息在一个适当的形式呈现给用户,纳入系统作为新知识或集成成一个专家系统。3.1聚类分析不论在其传统的形式还是概念聚类,
4、聚类分析试图分裂或合并一组数字基于误码率的对象组接近这些对象之间的存在。集群分组以便有大的对象之间的相似之处类以及大型之间的异同不同的类的对象。3.1.1传统聚类分析不管的缩放级别对象变量,有多种方法衡量相似性和区别的距离。基本的例子包括欧几里得(即平方根总平方差异)和曼哈顿差异(即绝对的总和个体差异变量)。在我们可以检查指标,名义以及不同数据集的混合距离测量。当对象有不同的类型的属性,例如,考夫曼和Rousseeuw推荐计算个人名义的差异为0属性值是相同的,和不同的是不同的。指标变量,我们第一次需要建立之间的区别对象的值。然后我们标准化把他们的最大区别。结果是一个0和之间的差异。然后我们计算
5、总差异两个对象之间的向量的总和个体差异(考夫曼和Rousseeuw 990)。我们可以使用这种类型的测量(最终延长individ的重量性属性)集群生产总值(gdp)数据集边际分析。这些包含名义属性(如产品、客户、地区)以及数值(收入或措施毛利率)。有一个普遍的分化在划分和层次之间分类方法。简而言之,合适的婚姻对象一对迭代方法试图最小化一个给定的初始分配的异质性表示“状态”的对象到集群。分层方法,这几乎是重要的,采取一种完全不同的方法。最初,每一个对象都位于自己的俱乐部怪兽。然而,对象,然后结合先后,因此只有最小程度的同质性是迷失在每一个步骤。我们可以很容易地生成的层次结构嵌套的集群在一个所谓的
6、系统树图。3.3归纳学习让我们假设有一个给定的一组对象(即一个训练集)类。归纳学习试图定义一个规则,基于其组织一个新对象属性到一个现有的类。一个常见的方法是可视化作为一个决策树学习规则。树叶而树的代表类主要降低分支机构代表测试分别检查一个属性值。每个测试接收的可能的结果自己的分公司,反过来,导致到另一个分支或熊一片叶子。的ID算法,一个著名的例子这种方法,从这一段开始训练集,我们可以在几个迭代建立一个树与0000集对象和50属性。ID子结果把剩下的对象的训练集,如果分类不正确,算法将重启一个训练集的扩展部分的对象是不正确的分类(昆兰986)。银行,例如,可以使用一个方法构建和维护这样一个专业的
7、系统检查的信用评级个人客户。如果一个训练集包含一个大客户群体高或低信用评级,该算法可以使用规则来评估未来的贷款申请,银行员工可以处理在系统中。4关键因素以下部分概述了一些与数据挖掘相关的问题。在我们认为,这些关键因素的成功将为未来打下坚实的基础研究和发展。4.1算法的效率关于数据挖掘的效率算法,我们应该考虑以下方面。jCalculation时期是一个关键因素。如果算法的计算时间增长速度比线性依赖关系的平方数的数据记录搜索,我们可以假设他们不会适合更大应用程序。我们可以提高计算时间通过限制搜索区域通过用户输入或减少通过有针对性的搜索数据量(如基于用户)选择和压缩。最近的进展显示,算法的计算时间将
8、变得不那么相关了由于技术发展(e . g .更快的过程-传感器、并行计算机)。因为算法必须足够健壮处理不完整和/或有缺陷的数据。这里的问题是有缺陷的数据产生明显的模式。如果一个销售区域有不小心遗忘了计划收入的一系列文章,该系统应诊断极高budget-actual方差。然而,系统不应该呈现这些类型的语句的一部分正常的分析结果,而是检测真实性检查和报告在一个单独的不完整的部分报告。外文文献原文Data Mining2 Usage scenariosData mining is widely used in a range of scientific disciplines and business
9、 scenarios. Some noteworthy examples include findings in the areas of database management, machine learning, Bayesian inference, knowledge gain for expert systems, fuzzy logic, neural networks, and genetic algorithms.Examples in everyday business scenarios include database marketing for airlines,pan
10、el data research as well as the creation of customized trade publications based on subscriber data for hundreds of different user groups. Frawley and Piatetsky-Shapiro (Frawley et al. 99) offer a detailed overview of further areas of usage.Gross margin analysis is another interesting field of resear
11、ch in data mining.With the help of modern cost accounting software, companies can perform multidimensional analysis on individual income items. Fig. 2 lists a few sample questions related to this topic. Due to the numer-ous reference objects (e. g. products, customers, sales channels, regions) and t
12、he resulting number of objects that need to be examined, controllers require methods that automatically identify data patterns.In this case, these patterns are a combination of attribute values (e. g. “DIY stores” and “power drills” in Fig. 1) as well as measures (e. g. gross margin). A company that
13、 develops a data mining program must also consider the large volumes of data involved. Even in a midsize company, for example, it is common that several hundred-thousand items flow into a monthly income statement.Case Based Reasoning (CBR) is one interesting example of how data mining and machine le
14、arning could work together. CBR components attempt to trace current questions to problems that have already been solved in the past. Help desks, which assist in clarifying the questions a customer has about purchased products, are one practical usage of this type of procedure. While some companies u
15、se help desks to support their telephone hotlines, others give their customers direct access through a remote data transfer. Data mining can be very valuable in this context because it consolidates the information gathered in thousands of individual historical cases into key findings. The advantage
16、of this procedure is the shorter process of searching for precedents which can be used to answer the current customers question.3 MethodsThere are many different types of methods to analyze and classify data. Some common methods include cluster analysis,Bayesian inference as well as inductive learni
17、ng. Cluster analysis can be used based on numerical measures as well as in the form of conceptual clustering.The structures of data mining systems are very different by nature. The following configuration, however, is very common:jThe analysis method, which identifies and analyzes patterns, forms th
18、e core of the system.jThe input can include components such as raw data, information from adata dictionary, knowledge of the usage scenario, or user entries to narrow the search process.jThe output encompasses the found measures, rules or information which are presented to the user in an appropriate
19、 form, incorporated into the system as new knowledge or integrated into an expert system.3.1 Cluster analysisWhether in its traditional form or as conceptual clustering, cluster analysis attempts to divide or combine a set number of objects into groups based on the proximity that exists among these
20、objects.The clusters are grouped so that there are large similarities among the objects of a class as well as large dissimilarities among the objects of different classes.3.1.1 Traditional cluster analysisRegardless of the scaling level of the object variables, there are multiple ways to measure the
21、 similarity and difference of the proximity. Basic examples include the Euclidean (i. e. the square root of the total squared difference) and Manhattan differences (i. e. the sum of the absolute differences of individual variables). In general, we can examine metric, nominal as well as mixed data se
22、ts by varying the proximity measure.When objects have different types of attributes, for example, Kaufman and Rousseeuw recommend calculating a difference of 0 for the individual nominal attributes when the values are the same,and a difference of when they are different. In the case of metric variab
23、les, we first need to establish the difference among the object values.To standardize them we then divide them by the maximum difference.The result is a difference between 0 and.We then calculate the total difference between two object vectors as the sum of the individual differences (Kaufman and Ro
24、usseeuw 990).We can use this type of measure (eventually extended by the weight of an individual attribute) to cluster data sets in grossmargin analysis. These contain nominal attributes (e. g. product, customer, region)as well as numerical measures (revenues or gross margin).There is a general diff
25、erentiation between the partitional and hierarchical classification methods. Simply put, partitional methods try to iteratively minimize the heterogeneity of a given initial allotment of objects into clusters. Hierarchical methods, which are practically significant,take a completely different approa
26、ch. Initially, each object is located in its own cluster. The objects, however, are then combined successively so that only the smallest level of homogeneity is lost in each step.We can easily present the resulting hierar- chy of nested clusters in a so-called dendrogram.3.1.2 Conceptual clusteringA
27、s described above, traditional forms of cluster analysis can identify groups of similar objects but cannot describe these classes beyond a simple list of the individual objects. The objective of many usage scenarios, however, is to characterize the existing structures that are buried among the volum
28、es of data. Instead of representing object classes through simply listing their objects, conceptual clusters intentionally describe them using terms which classify the individual objects through rules. A group of these rules forms a so-called concept.A basic example of a concept is a program that au
29、tomatically and logically links individual attribute values. Advanced systems can even establish concepts and concept hierarchies with classification rules.The different concepts in partitional methods of conceptual clustering compete with each other. Ultimately, we have to choose the clustering con
30、cept that best meets the performance criteria for a specific method. Some performance criteria include the simplicity of the concept (based on the number of attributes involved) or the discriminatory power (as the number of variables that have values do not overlap beyond the different object classe
31、s.)Similar to traditional cluster analysis, there are also hierarchical techniques that form classification trees in a topdown approach. As described above, the best classification in terms of performance criteria will take place on each level of the tree. The process ends when no further improvemen
32、t is possible from one tree4 Critical factorsThe following section outlines some problems associated with data mining. In our opinion, these critical factors for success will form the foundation for future research and development.4.1 Efficiency of algorithms Regarding the efficiency of data mining
33、algorithms, we should consider the following aspects.Calculation times are a key factor. If the calculation times of algorithms grow faster than the linear dependency of the squared number of data records to be searched, we could assume that they would not be suitable for larger applications. We can
34、 improve calculation times by limiting the search area through user input or reducing the searched data volume through targeted (e. g. user-based) selection and compression. Recent developments show that the calculation time of algorithms will become less relevant due to technical developments (e. g
35、. faster processors, parallel computers).The algorithms must be robust enough to deal with incomplete and/or flawed data. The problem here is that flawed data produces noticeable patterns. If a sales region had accidently forgotten to plan revenues for a series of articles,the system should diagnose extremely high budget-actual variances. The system, however, should not present these types of statements as part of the normal analysis results but rather detect them in a plausibility check and report the incomplete sections in a separate report.
限制150内