Breakthroughs on Cross-Cutting Data Management, Data Analytics, and Applied Data Science

  • PDF / 408,984 Bytes
  • 7 Pages / 595.224 x 790.955 pts Page_size
  • 33 Downloads / 264 Views

DOWNLOAD

REPORT


Breakthroughs on Cross-Cutting Data Management, Data Analytics, and Applied Data Science Silvia Chiusano1 · Tania Cerquitelli2 · Robert Wrembel3 · Daniele Quercia4,5 Accepted: 16 November 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

1 Introduction - Emerging Data-Centered Ecosystems In recent years, topics like (advanced) data analytics and data science have been constantly gaining popularity in research and business domains (Cerquitelli et al. 2020; Romero et al. 2020). Data analytics refers to the process of analyzing data by means of On-Line Analytical Processing techniques and Machine Learning (ML) algorithms, which are implemented in multiple procedural and declarative languages. Data science refers to techniques applied in the whole workflow (a.k.a. data processing pipeline) of data preparation for analysis and the analysis itself. The workflow typically includes the following tasks: data acquisition,

 Silvia Chiusano

[email protected] Tania Cerquitelli [email protected] Robert Wrembel [email protected] Daniele Quercia [email protected] 1

Interuniversity Department of Regional and Urban Studies and Planning, Politecnico di Torino, Turin, Italy

2

Department of Control and Computer Engineering, Politecnico di Torino, Turin, Italy

3

Faculty of Computing and Telecommunications, Poznan University of Technology, Poznan, Poland

4

King’s College, London, UK

5

Nokia Bell Labs, London, UK

transformation, integration, cleaning, pre-processing, labeling (for ML), data analysis, and sophisticated visualizations, as reported in the recent Gartner report (Gartner 2020). Presently, a huge and heterogeneous amount of data are continuously generated by humans and machines. These data are commonly referred to as big data (Ceravolo et al. 2018; Mauro et al. 2015). They are characterized mainly by: (1) the heterogeneity of data models and structures - from fully structured to unstructured, (2) a very high speed of creation, and (3) exceptionally large volumes. Using modern data integration and storage architectures (including polystores and data lakes), based on MapReduce as a processing model, distributed file systems as storage, in-memory data processing engines, and NoSQL data model, big data can be efficiently collected, integrated, stored, managed, and analyzed for novel and more interesting data-driven applications. The growing relevance of non-traditional domains such as bioinformatics (Liu et al. 2009), social networking (Zadeh et al. 2019), mobile computing (Deng et al. 2020), sensor applications (Strous and Cerf 2019), smart cities (Kar et al. 2019), and gaming (Clua et al. 2018) are generating increasing quantities of data that are complex in contents, heterogeneous in formats, and often order of terabytes in amount. These novel domains result in new ecosystems that include business areas such as: human resources, business processes, processes of data and information, IoT, mobile equipment, which definitely impact societies (Gupta et al.