The implementation of data storage and analytics platform for big data lake of electricity usage with spark

  • PDF / 3,182,587 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 57 Downloads / 175 Views

DOWNLOAD

REPORT


The implementation of data storage and analytics platform for big data lake of electricity usage with spark Chao‑Tung Yang1,2,3   · Tzu‑Yang Chen1 · Endah Kristiani4,5 · Shyhtsun Felix Wu6 Accepted: 29 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Electricity data could generate a large number of records from smart meter day by day. The traditional architecture might not properly handle the increasingly dynamic data that need flexibility. For effective storing and analytics, efficient architecture is needed to provide much greater data volumes and varieties. In this paper, we proposed the architecture of data storage and analytic in the big data lake of electricity usage using Spark. Apache Sqoop was used to migrate historical data to Apache Hive for processing from an existing system. Apache Kafka was used as the input source for Spark to stream data to Apache HBase to ensure the integrity of the streaming data. In order to integrate the data, we use the Hive and HBase principle of Data Lake as search engines for Hive and HBase. Apache Impala and Apache Phoenix are used separately. This work also analyzes electricity usage and power failure with Apache Spark. All of the visualizations of this project are presented in Apache Superset. Moreover, the usage prediction comparison is presented using HoltWinters algorithm. Keywords  Big data · Data lake · Data storage · Data visualization · Electricity data

1 Introduction Recent development of the rapid flow of Big Data information [14] presents significant benefits such as productivity and efficiency. If the organizations can analyze data in depth, they can collect huge potential information to take decisions more adequately, more clearly and more quickly. Companies have historically been able This document is the results of the research project funded by the Ministry of Science and Technology (MOST), Taiwan R.O.C., under Grant Number This document is the results of the research project funded by the Ministry of Science and Technology (MOST), Taiwan R.O.C., Under Grant Numbers 109-2221-E-029-020-, 109-2621-M-029-002- and 109-2119-M-029-001-A. * Chao‑Tung Yang [email protected] Extended author information available on the last page of the article

13

Vol.:(0123456789)



C.-T. Yang et al.

to deal with growing information and software issues with imports from the transaction database to the data warehouse [16] and business intelligence [5]. In evolving science and technology, the amount of data has continued to increase and the types of data become more complex [17, 24]. The increased use of the Internet of Things [29] and the increased of internet speed also generated a huge amount of data. In this case, the conventional data storage device architecture is not adequate for searching and analyzing large data. The needs for improving the speed and data confidentiality must be maintained. For this reason, we figured out that the traditional architecture would not be sufficient and the new design would contribute to the chall