Data Storage
The data store layer plays a vital role in a data analytics solution. This layer stores the data coming from a stream layer or data coming from various applications through an ETL process for further processing. In this chapter, the discussion is around w
- PDF / 467,608 Bytes
- 12 Pages / 504 x 720 pts Page_size
- 71 Downloads / 265 Views
Data Storage The data store layer plays a vital role in a data analytics solution. This layer stores the data coming from a stream layer or data coming from various applications through an ETL process for further processing. In this chapter, the discussion is around what role the data storage layer in data analytics plays and various storage options available on Microsoft Azure.
Figure 5-1. Data storage layer
Data Store This is the transition layer in a data analytics solution. In fact, this is the stage where the transformation journey of data starts. As shown in Figure 5-1 under the Ingest phase, data is coming from event streams for batch mode analysis or from different operational stores through Azure Data Factory (ADF) for further transformation.
© Harsh Chawla and Pankaj Khattar 2020 H. Chawla and P. Khattar, Data Lake Analytics on Microsoft Azure, https://doi.org/10.1007/978-1-4842-6252-8_5
87
Chapter 5
Data Storage
DATA SOURCES
INGEST
STORE
MODEL & SERVE
PREP & TRAIN Model Management
Sensors and IoT (unstructured)
Model Deployment
Azure Machine Learning Services
Azure Kubernetes Service
Azure Databricks Databric Logs (unstructured)
Real-time apps
SparkR S
Cosmos DB
Azure IoT Hub/ Event Hub
Media (unstructured) Apache Kafka Files (unstructured)
Business/custom apps (structured)
PolyBase Azure Data Factory
Azure Data Lake
Azure SQL Data Warehouse
Azure Analysis Services
Power BI
Figure 5-2. Storage layer in data analytics architecture To understand this phase better, let's discuss how having a storage layer saves lots of time and money. Earlier, in typical on-prem enterprise data warehouse scenarios, ETL was done. The following steps were generally taken: 1. Connect to all operational stores through ETL tools. 2. Build centralized operational data stores. 3. Understand KPIs and build data models. 4. Build caching layer using OLAP platforms to precalculate KPIs. 5. Consume these KPIs to build dashboards. For the structured data stores, the centralized operational data stores were the largest cost. That’s because the size of the database instance would be huge, and high- end compute machines were required to process the data. This needed a big upfront investment; that is, both software licenses and hardware had to be bought. However, in modern data warehouse scenarios, ELT is done instead of ETL. The primary reason for doing that is that public cloud platforms provide a pay as you go buying option. It means the data is brought onto data storage and then this entire data is processed using on-demand compute nodes of preferred technology. In Figure 5-1: 1. Streaming data is landing on Azure Blob Storage for batch processing. 2. Data from structured and unstructured applications is landing on Azure Blob using ADF. 88
Chapter 5
Data Storage
In this case, the data is brought into disks, which is a plain storage method and doesn’t need any investment on the software liceses or compute. Under the prep and train phase, data processing services Apache Spark in Azure Synapse analytics
Data Loading...