Research of Access Optimization of Small Files on Basis of B + Tree on Hadoop

Hadoop, the open-source software for reliable, scalable, distributed computing used in the processing and storage of extremely large data sets, is originally designed to store large amounts of large files resulting in huge wastage of storage space for Dat

PDF / 251,871 Bytes
8 Pages / 439.37 x 666.142 pts Page_size
40 Downloads / 213 Views

DOWNLOAD

REPORT

Abstract Hadoop, the open-source software for reliable, scalable, distributed computing used in the processing and storage of extremely large data sets, is originally designed to store large amounts of large ﬁles resulting in huge wastage of storage space for Data Node and increase in the memory space utilization, for Name Node, when dealing with massive small ﬁles. For the above shortcomings, this paper puts forward a optimization design for small ﬁles access scheme, which speeds up the small ﬁle location through the ﬁle index, on the Hadoop platform based on B + tree index, resulting in the improvement in the access efﬁciency of small ﬁles. The effectiveness of the proposed scheme is experimentally validated. Keywords Hadoop

HDFS B + tree Small ﬁles Sequence ﬁle

1 Introduction In recent years, with the continuous advances in information science and technology, “big data” technology has gradually become the focus of attention for both industry and academia [1, 2]. Faced with Internet data presented this explosive growth [3], Google company publishes papers which ﬁrst proposed the GFS, MapReduce, and other distributed data processing technology to deal with these massive data in 2003 [4, 5]. Thereafter, Apache Foundation develops a distributed system Hadoop. Since Hadoop development platform is originally intended to store large amounts of large ﬁles, which causes huge waste of storage space for DataNode, increased memory space utilization for NameNode when dealing with massive small ﬁles. Therefore, improving the processing ability of small HDFS ﬁle is paid more and more attention by the outside world. At present, there is a lot of research Y. Wang (&) Y. Li Y. Li Y. Shi W. Li School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 R.K. Choudhary et al. (eds.), Advanced Computing and Communication Technologies, Advances in Intelligent Systems and Computing 562, https://doi.org/10.1007/978-981-10-4603-2_19

197

198

Y. Wang et al.

work on Hadoop small ﬁle access method both in academic research and in the application of the major Internet Division. Mackey et al. [6] put forward a scheme that assigning a quota for each client in the HDFS ﬁle system and verifying the effectiveness of it. Then, on the basis of the solutions of Grant Mackey et al., Vorapongkitipun and Nupairoj [7] put forward a new Hadoop Har (archive) scheme to solve the problem that the metadata information of small ﬁles in HDFS system occupy the memory of NameNode, and improve the metadata memory utilization of the NameNode and the access efﬁciency of small ﬁles from two aspects. In China, many researchers and Internet companies, like taobao, tencent, etc., put forward their own solutions for the small ﬁle access problems. Changtong [8] merges small documents into one big ﬁle, and builds a Hash index for each of the merged ﬁles to improve the efﬁciency of small ﬁle access. Focusing on the problem that huge numbers of small ﬁles impo

Data Loading...

Research of Access Optimization of Small Files on Basis of B + Tree on Hadoop

Recommend Documents

Data Management Techniques in Hadoop Framework for Handling Small Files: A Survey

On the Singularity of Descriptive Files

B+-Tree

B-Tree

Optimization of marine vessels on the basis of tests on model series

Research on Leisure Sports Activities Based on Decision Tree Algorithm

A Conceptual Model for Dynamic Access Control in Hadoop Ecosystem

On the disruptive power of small-teams research

The Development of Research on Small Class Teaching in China

Optimizing B\(^+\) -Tree Searches on Coupled CPU-GPU Architectures

B-Tree, Versioned

B-Tree Concurrency Control