Data: The Fuel for Machine Learning

Machine learning is all about data. This chapter will explore the many aspects of data with the goal of meeting the following objectives:

PDF / 2,710,642 Bytes
58 Pages / 504 x 720 pts Page_size
93 Downloads / 225 Views

DOWNLOAD

REPORT

Data: The Fuel for Machine Learning Machine learning is all about data. This chapter will explore the many aspects of data with the goal of meeting the following objectives: •

Review the data explosion and three megatrends that are making this machine learning revolution possible.

•

Introduce the importance of data and reprogramming yourself to think like a data scientist.

•

Review different categories of data.

•

Review various formats of unstructured data, including CSV, ARFF, and JSON.

•

Use the OpenOffice Calc program to prepare CSV data.

•

Find and use publicly available data.

•

Introduce techniques for creating your own data.

•

Introduce preprocessing techniques to enhance the quality of your data.

•

Visualize data with JavaScript (Project).

•

Implement data visualization for Android (Project).

© Mark Wickham 2018 M. Wickham, Practical Java Machine Learning, https://doi.org/10.1007/978-1-4842-3951-3_2

47

Chapter 2

Data: The Fuel for Machine Learning

2.1 Megatrends Why is the ML revolution happening now? It is not the first time. In Chapter 1, I reviewed the previous AI booms and subsequent winter periods. How do we know if this time it is for real? Three transformational megatrends are responsible for the movement.

Three megatrends have paved the way for the machine learning revolution we are now experiencing: 1) Explosion of data 2) Access to highly scalable computing resources 3) Advancement in algorithms It is worth diving a little deeper into each of these megatrends.

Explosion of Data You have probably seen those crazy statistics about the amount of data created on a daily basis. There is a widely quoted statistic from IBM that states that 90% of all data on the Internet today was created since 2016. Large amounts of data certainly existed prior to 2016, so the study confirms what we already knew: people and devices today are pumping out huge amounts of data at an unprecedented rate. IBM stated that more than 2.5 exabytes (2.5 billion gigabytes) of data is generated every day. How much data is actually out there, and what are the sources of the data? It is hard to know with any degree of certainty. The data can be broken down into the following categories:

48

•

Internet social media

•

Internet non-social media

•

Mobile device data

•

Sensor data

•

Public data

•

Government data

•

Private data

•

Synthetic data

Chapter 2

Data: The Fuel for Machine Learning

Table 2-1 attempts to provide some insight into each category.

Table 2-1. Data Categories Data Category

Observation

Internet data

There are 3.8 billion desktop global Internet users. In 2017, users watched 4 million YouTube videos per minute. There were 5 billion daily Google searches in 2017.

Social media data

There are 655 million tweets per day. There are 1 million new social media accounts per day. There are 2 billion active Facebook users. 67 million Instagram posts are added each day

Mobile device data 22 billion text messages were sent each day in 2017. There are 3.5 billion

Data Loading...

Data: The Fuel for Machine Learning

Recommend Documents

Benchmark AFLOW Data Sets for Machine Learning

Big Data and Machine Learning

Machine Learning and Deep Learning Models for Big Data Issues

Machine Learning and Data Mining

Ensuring Data Privacy Using Machine Learning for Responsible Data Science

Machine learning and data analytics for the IoT

Investing Data with Machine Learning Using Python

Advances in Machine Learning and Data Analysis

Big Data Analytics and Machine Learning Technologies for HPC Applications

Rule Based Systems for Big Data A Machine Learning Approach

Kernel-based Data Fusion for Machine Learning Methods and Applicatio

Machine Learning Models and Algorithms for Big Data Classification T