An Architecture for Data Warehousing in Big Data Environments

Recent advances in Information Technologies facilitate the increasing capacity to collect and store data, being the Big Data term often mentioned. In this context, many challenges need to be addressed, being Data Warehousing one of them. In this sense, th

  • PDF / 3,673,210 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 42 Downloads / 193 Views

DOWNLOAD

REPORT


Abstract. Recent advances in Information Technologies facilitate the increasing capacity to collect and store data, being the Big Data term often mentioned. In this context, many challenges need to be addressed, being Data Warehousing one of them. In this sense, the main purpose of this work is to propose an architecture for Data Warehousing in Big Data, taking as input a data source stored in a traditional Data Warehouse, which is transformed into a Data Warehouse in Hive. Before proposing and implementing the architecture, a benchmark was conducted to verify the processing times of Hive and Impala, understanding how these technologies could be integrated in an architecture where Hive plays the role of a Data Warehouse and Impala is the driving force for the analysis and visualization of data. After the proposal of the architecture, it was implemented using tools like the Hadoop ecosystem, Talend and Tableau, and validated using a data set with more than 100 million records, obtaining satisfactory results in terms of processing times. Keywords: Big data

 Data warehouse  NoSQL  Hadoop  Hive  Impala

1 Introduction Nowadays, due to the high competitiveness that exists between organizations, they need to invest more and more in technology. Usually, the cause of this need involves the frequent change of the business trends as well as their customers’ habits [1]. Data Warehouse and On-line Analytical Processing (OLAP) are technologies that have been following this evolution to the present day [1], being a Data Warehouse a database to support analytical processing and to assist in decision making process [2]. The implementation of these systems usually occurs in relational databases that may not be able to store and process large volumes of data [3]. With the recent technological advances, organizations are collecting more and more data, with different types, formats and speeds. When used and analyzed in the proper way these data have enormous potential, enabling organizations to completely change their business systems for better results [4]. Transforming the potential of the information, in this increasingly digital world, requires not only new data analysis algorithms, but also a new generation of systems and distributed computing environments to deal with the sharp increase in the volume of data and its lack of structure [5]. The challenge is to enhance the value of these data, as these are sometimes in completely different formats [6]. Combining the large amounts of data with the need to analyze © IFIP International Federation for Information Processing 2016 Published by Springer International Publishing AG 2016. All Rights Reserved A.M. Tjoa et al. (Eds.): CONFENIS 2016, LNBIP 268, pp. 237–250, 2016. DOI: 10.1007/978-3-319-49944-4_18

238

B. Martinho and M.Y. Santos

them, there is a need to think the role of Data Warehousing in the context of Big Data, being Big Data the ability to collect, store and process large volumes of data [4]. Big Data refers mainly to the massive amounts of unstructured data produ