11-An Effective Merge Strategy Based Hierarchy.docx
《11-An Effective Merge Strategy Based Hierarchy.docx》由会员分享,可在线阅读,更多相关《11-An Effective Merge Strategy Based Hierarchy.docx(6页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、An Effective Merge Strategy Based Hierarchy for ImprovingSmall File Problem on HDFSZhipeng Gao1, Yinghao Qin1, Kun Niu2,State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications,Beijing 100876, China2School of Software Engineering, Beijing Univer
2、sity of Posts and Telecommunications, Beijing 100876, Chinagaozhipeng, qyhqyh 123123163 , niukunAbstract: Hadoop Distributed File System (HDFS) is designed for reliable storage and management of very large file and low-cost storage capability. As HDFS architecture bases on master (NameNode) to handl
3、e metadata for multiple slaves (DataNode), NameNode often becomes bottleneck, especially when handing large number of small files. It is a common solution to merge many small files into one big file about this problem. HDFS does not consider the correlation between the files stored on it, it is hard
4、 to use one efficient prefetching mechanism. To solve the large small files problem and to improve efficiency of accessing small files , in this paper, we define Logic File Name (LFN) and propose Small file Merge Strategy Based LFN (SMSBL). SMSBL is a new idea and a new perspective on hierarchy, it
5、improves the correlation of small files in the same block of HDFS effectively based different file system hierarchy, so the performance is amazing facing large small files when HDFS adopted SMSBL with prefetching mechanism. The system efficiency analysis model is established and experimental results
6、 demonstrate that SMSBL can solve small file problem in HDFS and has appreciable high hit rate of prefetching files.Keywords: HDFS; small files; merging and prefetching;IntroductionWith the rapid development of Internet services, the amount of data grows exponentially, cloud computing has become inc
7、reasingly popular as the next infrastructure for hosting data and deploying software and services 1. As data explode from the web applications and grow bigger, it is hard for traditional system to handle this situation, so new projects and systems are needed. Distributed file systems are introduced
8、and developed for cloud computing and for big data. The most three famous distributed file systems examples are Google file system (GFS) 2, Hadoop distributed file system (HDFS) and Amazon Simple Storage Service (S3). Among them, HDFS is an open-source software framework influenced by GoogleFS, it i
9、s widely used in many industries such as science, biology, climatology, astronomy, finance, internet, geography, etc 3.A small file is a file whose size less than the HDFS block size 4. The design of the HDFS is for storing large files, HDFS is inefficient because it has high memory usage and unacce
10、ptable access cost when it stores a large number of small files 5.The rest of the paper is organized as follows. Section II discusses the background and related work. Section III describes our proposed new approach SMSBL to impove HDFS. Experiments are conducted in section IV. Section V concludes th
11、e summary of this paper and future work.1 BackgroundHDFSInternet giants, such as Facebook, Twitter and YAHOO, use HDFS as their basic distributed data storage environment. The design of Hadoop Distributed File System (HDFS) is fully inspired by Google File System (GFS). Both HDFS and GFS are master/
12、slave architecture. The previous version of HDFS cluster has only one master node called Namenode, which holds and manages the metadata information include distributed file system namespace, file descriptions, file-data block mappings, data block allocations, access regulations and so on 6. The High
13、 Availability (HA) version of HDFS cluster has more than one namenode, active namenode and standby namenode, to enhance reliability of HDFS and efficiency of operation in cluster 7. Along with continuous improvemengt of Hadoop and HDFS, they play more and more important roles in the era of big data.
14、Namenode not only is in charge of responsing to client who requests for accessing some files on the HDFS cluster, but also manages huge metadata information. Whatever the size of file is, the metadata of each file consumes 250bytes and its block with default three replicas consumes 368 bytes in memo
15、ry of Namenode 8. When many small files are stored in HDFS, the memory of Namenode will has high pressure and then maintaining these huge metadata is inefficient 9.When the size of relevant metadats is larger than the size of memory in Namenode or frequently accessing small files exceeding I/O capac
16、ity of the cluster, this HDFS may shutdown abnormally 10. Above situations happen on account of that design of original HDFS is for huge files with the purpose of big scale transfer 11.Along with the era of big data coming, Hadoop and HDFS have to deal with many complicated data which them may not b
17、e good at. HDFS architecture is shown in Fig.l.Related workThe present studies on handing small file problem mainly concentrate on two ways:One of ways is improving the architectural of distributed files systems. Xuhui Lius approach is merging small files into big ones and building indexs for each s
18、mall file which stored consecutively in physical memory in terms of their geographic locations where hash index is used 12. But his approach is exclusively used in WebGIS. Chandrasekar S proposed Extended Hadoop Distributed File System (EHDFS), it has been designed and implemented in such a way that
19、 a large number of small files can be merged into a single combined file and it also provides a framework for prefetching metadata for a specified number of files 13. Chatuporn Vbrapongkitpun proposed a mechanism based on Hadoop Archive (HAR), called New Hadoop Archive (NHAR), to reduce the memory u
20、tilization for metadata and enhance the efficiency of accessing small files in HDFS. Dipayan Dev designed Hadoop Archive Plus (HAR+) using hashtable on its architecture, selected sha256 as the key, which is a modification of existing HAR. HAR+ is designed to provide more reliability and it can also
21、provide auto scaling of metadata 14. But improvement of this architectural design is very complex and high cost in resource.Optimizing strategy of merging small files and using cache for prefetching files is another way to impove HDFS handing small files. Du zhonghui presented a balance of small fil
22、es merging algorithm to optimize distribution of merged large files volume, which could effectively reduce the numbers of HDFS data blocks. Zhang chunming proposed an approach called HIFM (hierarchy index file merging), in which the correlations between small files and the directory structure are co
23、nsidered to assist merging small files into large ones and then generating hierarchical index 15. Zhang chunming proposition is good but it has not good universality.1.1 Our contributionThrough study on the above these approaches, we propose a new approach which have better performace in local corre
24、lation and in universality. In our new approacch we define Logic File Name (LFN), and adjust order of fields in LFN to match its using environment. Our approach named small file merge strategy based LFN (SMSBL) can be used in most hierarchy file systems and can elevate the hit rate of prefetching fi
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 11-An Effective Merge Strategy Based Hierarchy 11 An
链接地址:https://www.taowenge.com/p-72765216.html
限制150内