Data Storage
The data store layer plays a vital role in a data analytics solution. This layer stores the data coming from a stream layer or data coming from various applications through an ETL process for further processing. In this chapter, the discussion is around w
- PDF / 467,608 Bytes
- 12 Pages / 504 x 720 pts Page_size
- 71 Downloads / 314 Views
		    Data Storage The data store layer plays a vital role in a data analytics solution. This layer stores the data coming from a stream layer or data coming from various applications through an ETL process for further processing. In this chapter, the discussion is around what role the data storage layer in data analytics plays and various storage options available on Microsoft Azure.
 
 Figure 5-1.  Data storage layer
 
 Data Store This is the transition layer in a data analytics solution. In fact, this is the stage where the transformation journey of data starts. As shown in Figure 5-1 under the Ingest phase, data is coming from event streams for batch mode analysis or from different operational stores through Azure Data Factory (ADF) for further transformation.
 
 © Harsh Chawla and Pankaj Khattar 2020 H. Chawla and P. Khattar, Data Lake Analytics on Microsoft Azure, https://doi.org/10.1007/978-1-4842-6252-8_5
 
 87
 
 Chapter 5
 
 Data Storage
 
 DATA SOURCES
 
 INGEST
 
 STORE
 
 MODEL & SERVE
 
 PREP & TRAIN Model Management
 
 Sensors and IoT (unstructured)
 
 Model Deployment
 
 Azure Machine Learning Services
 
 Azure Kubernetes Service
 
 Azure Databricks Databric Logs (unstructured)
 
 Real-time apps
 
 SparkR S
 
 Cosmos DB
 
 Azure IoT Hub/ Event Hub
 
 Media (unstructured) Apache Kafka Files (unstructured)
 
 Business/custom apps (structured)
 
 PolyBase Azure Data Factory
 
 Azure Data Lake
 
 Azure SQL Data Warehouse
 
 Azure Analysis Services
 
 Power BI
 
 Figure 5-2.  Storage layer in data analytics architecture To understand this phase better, let's discuss how having a storage layer saves lots of time and money. Earlier, in typical on-prem enterprise data warehouse scenarios, ETL was done. The following steps were generally taken: 1. Connect to all operational stores through ETL tools. 2. Build centralized operational data stores. 3. Understand KPIs and build data models. 4. Build caching layer using OLAP platforms to precalculate KPIs. 5. Consume these KPIs to build dashboards. For the structured data stores, the centralized operational data stores were the largest cost. That’s because the size of the database instance would be huge, and high- end compute machines were required to process the data. This needed a big upfront investment; that is, both software licenses and hardware had to be bought. However, in modern data warehouse scenarios, ELT is done instead of ETL. The primary reason for doing that is that public cloud platforms provide a pay as you go buying option. It means the data is brought onto data storage and then this entire data is processed using on-demand compute nodes of preferred technology. In Figure 5-1: 1. Streaming data is landing on Azure Blob Storage for batch processing. 2. Data from structured and unstructured applications is landing on Azure Blob using ADF. 88
 
 Chapter 5
 
 Data Storage
 
 In this case, the data is brought into disks, which is a plain storage method and doesn’t need any investment on the software liceses or compute. Under the prep and train phase, data processing services Apache Spark in Azure Synapse analytics		
Data Loading...
 
	 
	 
	 
	 
	 
	 
	 
	 
	 
	 
	