《数据分析》PPT课件.ppt
《《数据分析》PPT课件.ppt》由会员分享,可在线阅读,更多相关《《数据分析》PPT课件.ppt(27页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、大规模数据分析方法对比A Comparison of Approaches to Large-Scale Data Analysis2作者作者1:Andrew Pavlo,Brown University1 MapReduce and parallel DBMSs:friends or foes?朋友还是冤家2 A comparison of approaches to large-scale data analysis 3 H-store:a high-performance,distributed main memory transaction processing system 4 Th
2、e NMI build&test laboratory:continuous integration framework for distributed computing software5 Smoother transitions between breadth-first-spanning-tree-based drawings主要做Hadoop(Mapreduce)和并行数据库管理系统比较,用于大规模数据集分析。作者简介3作者作者2 Erik Paulson,University of Wisconsin1 MapReduce and parallel DBMSs:friends or
3、 foes?2 A comparison of approaches to large-scale data analysis3 Clustera:an integrated computation and data management system和第一作者一样,主要做Hadoop(Mapreduce)和并行数据库管理系统比较,用于大规模数据集分析。4作者作者3 Alexander Rasin,Brown University1 CORADD:correlation aware database designer for materialized views and indexes2 Ma
4、pReduce and parallel DBMSs:friends or foes?3 HadoopDB:an architectural hybrid of MapReduce and DBMS technologies for analytical workloads4 Correlation maps:a compressed access method for exploiting soft functional dependencies5 A comparison of approaches to large-scale data analysis6 H-store:a high-
5、performance,distributed main memory transaction processing system 作者在本文的基础上,设计了HadoopDB系统,一个Mapreduce和并行数据库管理系统结合的系统。5摘要目前有相当大的兴趣在基于MapReduce(MR)模式的大规模数据分析。虽然这个框架的基本控制流已经存在于并行SQL数据库管理系统超过20年,也有人称MR为最新的计算模型。在本文中,我们描述和比较这两个模式。此外,我们评估两个系统的性能和开发复杂度。最后,我们定义一个包含任务集的基准运行于MR开源平台和两个并行数据库管理系统上。对于每个任务,我们在100台机
6、子的集群上衡量每个系统的各个方面的并行性能。我们的研究结果揭示了一些有趣的取舍。虽然加载数据和调整并行数据库管理系统执行的过程比MR花费更多的时间,但是观察到的这些数据库管理系统性能显著地改善。我们推测巨大的性能差异的原因,并考虑将来的系统应该从这两种架构中吸取优势。6ABSTRACT:There is currently considerable enthusiasm around the MapReduce(MR)paradigm for large-scale data analysis.Although the basic control flow of this framework
7、has existed in parallel SQL database management systems(DBMS)for over 20 years,some have called MR a dramatically new computing model.In this paper,we describe and compare both paradigms.Furthermore,we evaluate both kinds of systems in terms of performance and development complexity.To this end,we d
8、efine a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs.For each task,we measure each systems performance for various degrees of parallelism on a cluster of 100 nodes.Our results reveal some interesting trade-offs.Althou
9、gh the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system,the observed performance of these DBMSs was strikingly better.We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems sh
10、ould take from both kinds of architectures.71引言本文主要目的是如何在Hadoop、DBMS-X、Vertica中取舍和选择。第二部分主要介绍大规模数据分析的两种方法,Mapreduce和并行数据库管理系统。第三部分主要介绍系统架构,包括支持的数据格式、索引、编程模型等。第四部分主要是基准测试,在100个节点集群上运行几个任务来测试Mapreduce,DBMS-X,Vertica。对100个节点上测试有没有代表性进行解释:eBay 的TeraData配置使用72个节点(两个四核CPU,32GB内存,104个300GB磁盘)管理的关系型数据;Fox互动
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据分析 数据 分析 PPT 课件
限制150内