Specialized Machine Learning Topics

This chapter presents some technical details about data formats, streaming, optimization of computation, and distributed deployment of optimized learning algorithms. Chapter  22 provides additional optimization details. We show format conversion and worki

  • PDF / 3,593,114 Bytes
  • 44 Pages / 439.37 x 666.142 pts Page_size
  • 64 Downloads / 196 Views

DOWNLOAD

REPORT


Specialized Machine Learning Topics

This chapter presents some technical details about data formats, streaming, optimization of computation, and distributed deployment of optimized learning algorithms. Chapter 22 provides additional optimization details. We show format conversion and working with XML, SQL, JSON, 15 CSV, SAS and other data objects. In addition, we illustrate SQL server queries, describe protocols for managing, classifying and predicting outcomes from data streams, and demonstrate strategies for optimization, improvement of computational performance, parallel (MPI) and graphics (GPU) computing. The Internet of Things (IoT) leads to a paradigm shift of scientific inference – from static data interrogated in a batch or distributed environment to on-demand service-based Cloud computing. Here, we will demonstrate how to work with specialized data, data-streams, and SQL databases, as well as develop and assess on-the-fly data modeling, classification, prediction and forecasting methods. Important examples to keep in mind throughout this chapter include high-frequency data delivered real time in hospital ICU’s (e.g., microsecond Electroencephalography signals, EEGs), dynamically changing stock market data (e.g., Dow Jones Industrial Average Index, DJI), and weather patterns. We will present (1) format conversion of XML, SQL, JSON, CSV, SAS and other data objects, (2) visualization of bioinformatics and network data, (3) protocols for managing, classifying and predicting outcomes from data streams, (4) strategies for optimization, improvement of computational performance, parallel (MPI) and graphics (GPU) computing, and (5) processing of very large datasets.

16.1

Working with Specialized Data and Databases

Unlike the case studies we saw in the previous chapters, some real world data may not always be nicely formatted, e.g., as CSV files. We must collect, arrange, wrangle, and harmonize scattered information to generate computable data objects that can be further processed by various techniques. Data wrangling and preprocessing may take © Ivo D. Dinov 2018 I. D. Dinov, Data Science and Predictive Analytics, https://doi.org/10.1007/978-3-319-72347-1_16

513

514

16 Specialized Machine Learning Topics

over 80% of the time researchers spend interrogating complex multi-source data archives. The following procedures will enhance your skills in collecting and handling heterogeneous real world data. Multiple examples of handling long-and-wide data, messy and tidy data, and data cleaning strategies can be found in this JSS Tidy Data article by Hadley Wickham.

16.1.1

Data Format Conversion

The R package rio imports and exports various types of file formats, e.g., tab-separated (.tsv), comma-separated (.csv), JSON (.json), Stata (.dta), SPSS (.sav and .por), Microsoft Excel (.xls and .xlsx), Weka (.arff), and SAS (.sas7bdat and .xpt). rio provides three important functions import(), export() and convert(). They are intuitive, easy to understand, and efficient to execute. Take Stata (.dta) files as an example. First