Research of Access Optimization of Small Files on Basis of B + Tree on Hadoop

Hadoop, the open-source software for reliable, scalable, distributed computing used in the processing and storage of extremely large data sets, is originally designed to store large amounts of large files resulting in huge wastage of storage space for Dat

  • PDF / 251,871 Bytes
  • 8 Pages / 439.37 x 666.142 pts Page_size
  • 40 Downloads / 193 Views

DOWNLOAD

REPORT


Abstract Hadoop, the open-source software for reliable, scalable, distributed computing used in the processing and storage of extremely large data sets, is originally designed to store large amounts of large files resulting in huge wastage of storage space for Data Node and increase in the memory space utilization, for Name Node, when dealing with massive small files. For the above shortcomings, this paper puts forward a optimization design for small files access scheme, which speeds up the small file location through the file index, on the Hadoop platform based on B + tree index, resulting in the improvement in the access efficiency of small files. The effectiveness of the proposed scheme is experimentally validated. Keywords Hadoop

 HDFS  B + tree  Small files  Sequence file

1 Introduction In recent years, with the continuous advances in information science and technology, “big data” technology has gradually become the focus of attention for both industry and academia [1, 2]. Faced with Internet data presented this explosive growth [3], Google company publishes papers which first proposed the GFS, MapReduce, and other distributed data processing technology to deal with these massive data in 2003 [4, 5]. Thereafter, Apache Foundation develops a distributed system Hadoop. Since Hadoop development platform is originally intended to store large amounts of large files, which causes huge waste of storage space for DataNode, increased memory space utilization for NameNode when dealing with massive small files. Therefore, improving the processing ability of small HDFS file is paid more and more attention by the outside world. At present, there is a lot of research Y. Wang (&)  Y. Li  Y. Li  Y. Shi  W. Li School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 R.K. Choudhary et al. (eds.), Advanced Computing and Communication Technologies, Advances in Intelligent Systems and Computing 562, https://doi.org/10.1007/978-981-10-4603-2_19

197

198

Y. Wang et al.

work on Hadoop small file access method both in academic research and in the application of the major Internet Division. Mackey et al. [6] put forward a scheme that assigning a quota for each client in the HDFS file system and verifying the effectiveness of it. Then, on the basis of the solutions of Grant Mackey et al., Vorapongkitipun and Nupairoj [7] put forward a new Hadoop Har (archive) scheme to solve the problem that the metadata information of small files in HDFS system occupy the memory of NameNode, and improve the metadata memory utilization of the NameNode and the access efficiency of small files from two aspects. In China, many researchers and Internet companies, like taobao, tencent, etc., put forward their own solutions for the small file access problems. Changtong [8] merges small documents into one big file, and builds a Hash index for each of the merged files to improve the efficiency of small file access. Focusing on the problem that huge numbers of small files impo