Data Storage

The data store layer plays a vital role in a data analytics solution. This layer stores the data coming from a stream layer or data coming from various applications through an ETL process for further processing. In this chapter, the discussion is around w

  • PDF / 467,608 Bytes
  • 12 Pages / 504 x 720 pts Page_size
  • 71 Downloads / 265 Views

DOWNLOAD

REPORT


Data Storage The data store layer plays a vital role in a data analytics solution. This layer stores the data coming from a stream layer or data coming from various applications through an ETL process for further processing. In this chapter, the discussion is around what role the data storage layer in data analytics plays and various storage options available on Microsoft Azure.

Figure 5-1.  Data storage layer

Data Store This is the transition layer in a data analytics solution. In fact, this is the stage where the transformation journey of data starts. As shown in Figure 5-1 under the Ingest phase, data is coming from event streams for batch mode analysis or from different operational stores through Azure Data Factory (ADF) for further transformation.

© Harsh Chawla and Pankaj Khattar 2020 H. Chawla and P. Khattar, Data Lake Analytics on Microsoft Azure, https://doi.org/10.1007/978-1-4842-6252-8_5

87

Chapter 5

Data Storage

DATA SOURCES

INGEST

STORE

MODEL & SERVE

PREP & TRAIN Model Management

Sensors and IoT (unstructured)

Model Deployment

Azure Machine Learning Services

Azure Kubernetes Service

Azure Databricks Databric Logs (unstructured)

Real-time apps

SparkR S

Cosmos DB

Azure IoT Hub/ Event Hub

Media (unstructured) Apache Kafka Files (unstructured)

Business/custom apps (structured)

PolyBase Azure Data Factory

Azure Data Lake

Azure SQL Data Warehouse

Azure Analysis Services

Power BI

Figure 5-2.  Storage layer in data analytics architecture To understand this phase better, let's discuss how having a storage layer saves lots of time and money. Earlier, in typical on-prem enterprise data warehouse scenarios, ETL was done. The following steps were generally taken: 1. Connect to all operational stores through ETL tools. 2. Build centralized operational data stores. 3. Understand KPIs and build data models. 4. Build caching layer using OLAP platforms to precalculate KPIs. 5. Consume these KPIs to build dashboards. For the structured data stores, the centralized operational data stores were the largest cost. That’s because the size of the database instance would be huge, and high-­ end compute machines were required to process the data. This needed a big upfront investment; that is, both software licenses and hardware had to be bought. However, in modern data warehouse scenarios, ELT is done instead of ETL. The primary reason for doing that is that public cloud platforms provide a pay as you go buying option. It means the data is brought onto data storage and then this entire data is processed using on-demand compute nodes of preferred technology. In Figure 5-1: 1. Streaming data is landing on Azure Blob Storage for batch processing. 2. Data from structured and unstructured applications is landing on Azure Blob using ADF. 88

Chapter 5

Data Storage

In this case, the data is brought into disks, which is a plain storage method and doesn’t need any investment on the software liceses or compute. Under the prep and train phase, data processing services Apache Spark in Azure Synapse analytics