Sandbox security model for Hadoop file system

  • PDF / 1,228,491 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 31 Downloads / 303 Views

DOWNLOAD

REPORT


pen Access

RESEARCH

Sandbox security model for Hadoop file system Gousiya Begum1,4*  , S. Zahoor Ul Huq2 and A. P. Siva Kumar3 *Correspondence: [email protected] 1 Department of Computer Science and Engineering, Mahatma Gandhi Institute of Technology, Gandipet, Hyderabad, India Full list of author information is available at the end of the article

Abstract  Extensive usage of Internet based applications in day to day life has led to generation of huge amounts of data every minute. Apart from humans, data is generated by machines like sensors, satellite, CCTV etc. This huge collection of heterogeneous data is often referred as Big Data which can be processed to draw useful insights. Apache Hadoop has emerged has widely used open source software framework for Big Data Processing and it is a cluster of cooperative computers enabling distributed parallel processing. Hadoop Distributed File System is used to store data blocks replicated and spanned across different nodes. HDFS uses an AES based cryptographic techniques at block level which is transparent and end to end in nature. However cryptography provides security from unauthorized access to the data blocks, but a legitimate user can still harm the data. One such example was execution of malicious map reduce jar files by legitimate user which can harm the data in the HDFS. We developed a mechanism where every map reduce jar will be tested by our sandbox security to ensure the jar is not malicious and suspicious jar files are not allowed to process the data in the HDFS. This feature is not present in the existing Apache Hadoop framework and our work is made available in github for consideration and inclusion in the future versions of Apache Hadoop. Keywords:  HDFS, MapReduce, Fsimage, Hadoop, Kerberos

Introduction Apache Hadoop has emerged as the widely used open source framework for Big Data Processing. Big Data processing is used in healthcare, social media, banking, insurance, good governance, stock markets, retail and supply chain, ecommerce, education and scientific research etc. to gain deep insights of the data, their associations and make better decisions [1]. Apache Hadoop addresses the two major challenges of Big Data viz. storage and processing. Data is stored in Hadoop using HDFS and processing through Map Reduce Programming. Apache Hadoop is a cluster of cooperative computers. The anatomy of Hadoop cluster can be easily understood from the Fig. 1. The number of Data nodes can vary from cluster to cluster but every Hadoop cluster must contain Name node, Resource Manager and Secondary Name node. In Hadoop, files are stored using HDFS.

© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this ar