High-performance IoT streaming data prediction system using Spark: a case study of air pollution

PDF / 970,315 Bytes
8 Pages / 595.276 x 790.866 pts Page_size
17 Downloads / 241 Views

(0123456789().,-volV)(0123456789(). ,- volV)

S.I. : GREEN AND HUMAN INFORMATION TECHNOLOGY 2019

High-performance IoT streaming data prediction system using Spark: a case study of air pollution Ho-Yong Jin1 • Eun-Sung Jung1

•

Duckki Lee2

Received: 5 May 2019 / Accepted: 10 December 2019 Springer-Verlag London Ltd., part of Springer Nature 2019

Abstract Internet-of-Things (IoT) devices are becoming prevalent, and some of them, such as sensors, generate continuous timeseries data, i.e., streaming data. These IoT streaming data are one of Big Data sources, and they require careful consideration for efficient data processing and analysis. Deep learning is emerging as a solution to IoT streaming data analytics. However, there is a persistent problem in deep learning that it takes a long time to learn neural networks. In this paper, we propose a high-performance IoT streaming data prediction system to improve the learning speed and to predict in real time. We showed the efficacy of the system through a case study of air pollution. The experimental results show that the modified LSTM autoencoder model shows the best performance compared to a generic LSTM model. We noticed that achieving the best performance requires optimizing many parameters, including learning rate, epoch, memory cell size, input timestep size, and the number of features/predictors. In that regard, we show that the high-performance data learning/ prediction frameworks (e.g., Spark, Dist-Keras, and Hadoop) are essential to rapidly fine-tune a model for training and testing before real deployment of the model as data accumulate. Keywords Long Short-Term Memory (LSTM) Distributed deep learning Distributed Keras (Dist-Keras) Apache Spark

1 Introduction Internet-of-Things (IoT) devices are becoming prevalent, and some of them, such as sensors, generate continuous time-series data, i.e., streaming data. These IoT streaming data are one of Big Data sources [1], and they require careful consideration for efficient data processing and analysis. In this paper, we present how distributed systems can be used for such purposes.

& Eun-Sung Jung [email protected] Ho-Yong Jin [email protected] Duckki Lee [email protected] 1

Department of Software and Communications Engineering, Hongik University, Sejong, South Korea

2

Department of Smart Software, Yonam Institute of Technology, Jinju, South Korea

The system uses a distributed deep learning framework called Distributed Keras (Dist-Keras) [2] and Long ShortTerm Memory (LSTM) [3] units suitable for time-series data prediction. Dist-Keras is a distributed deep learning framework built on top of Apache Spark and Keras, with a focus on ‘‘state-of-the-art’’ distributed optimization algorithms. Most of the distributed optimizers Dist-Keras provides are based on data-parallel methods. A data-parallel method is a learning paradigm where multiple replicas of a single model are used to optimize a sole objective. Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput,

Data Loading...

High-performance IoT streaming data prediction system using Spark: a case study of air pollution

Recommend Documents

An IoT-Based Pollution Monitoring System Using Data Analytics Approach

SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming

Smart City Air Pollution Monitoring and Prediction: A Case Study of Skopje

Streaming IoT Data to Microsoft Azure

IoT streaming data integration from multiple sources

A Real-Time Recommender System Design Based on Spark Streaming

Air Quality Prediction Using Machine Learning Methods: A Case Study of Bjelave Neighborhood, Sarajevo, BiH

Clustering Imputation for Air Pollution Data

A visual big data system for the prediction of weather-related variables: Jordan-Spain case study

Reduction of Data Leakage Using Software Streaming

Mobile System for Air Pollution Evaluation

Logging and Monitoring System for Streaming Data