Data: The Fuel for Machine Learning
Machine learning is all about data. This chapter will explore the many aspects of data with the goal of meeting the following objectives:
- PDF / 2,710,642 Bytes
- 58 Pages / 504 x 720 pts Page_size
- 93 Downloads / 198 Views
Data: The Fuel for Machine Learning Machine learning is all about data. This chapter will explore the many aspects of data with the goal of meeting the following objectives: •
Review the data explosion and three megatrends that are making this machine learning revolution possible.
•
Introduce the importance of data and reprogramming yourself to think like a data scientist.
•
Review different categories of data.
•
Review various formats of unstructured data, including CSV, ARFF, and JSON.
•
Use the OpenOffice Calc program to prepare CSV data.
•
Find and use publicly available data.
•
Introduce techniques for creating your own data.
•
Introduce preprocessing techniques to enhance the quality of your data.
•
Visualize data with JavaScript (Project).
•
Implement data visualization for Android (Project).
© Mark Wickham 2018 M. Wickham, Practical Java Machine Learning, https://doi.org/10.1007/978-1-4842-3951-3_2
47
Chapter 2
Data: The Fuel for Machine Learning
2.1 Megatrends Why is the ML revolution happening now? It is not the first time. In Chapter 1, I reviewed the previous AI booms and subsequent winter periods. How do we know if this time it is for real? Three transformational megatrends are responsible for the movement.
Three megatrends have paved the way for the machine learning revolution we are now experiencing: 1) Explosion of data 2) Access to highly scalable computing resources 3) Advancement in algorithms It is worth diving a little deeper into each of these megatrends.
Explosion of Data You have probably seen those crazy statistics about the amount of data created on a daily basis. There is a widely quoted statistic from IBM that states that 90% of all data on the Internet today was created since 2016. Large amounts of data certainly existed prior to 2016, so the study confirms what we already knew: people and devices today are pumping out huge amounts of data at an unprecedented rate. IBM stated that more than 2.5 exabytes (2.5 billion gigabytes) of data is generated every day. How much data is actually out there, and what are the sources of the data? It is hard to know with any degree of certainty. The data can be broken down into the following categories:
48
•
Internet social media
•
Internet non-social media
•
Mobile device data
•
Sensor data
•
Public data
•
Government data
•
Private data
•
Synthetic data
Chapter 2
Data: The Fuel for Machine Learning
Table 2-1 attempts to provide some insight into each category.
Table 2-1. Data Categories Data Category
Observation
Internet data
There are 3.8 billion desktop global Internet users. In 2017, users watched 4 million YouTube videos per minute. There were 5 billion daily Google searches in 2017.
Social media data
There are 655 million tweets per day. There are 1 million new social media accounts per day. There are 2 billion active Facebook users. 67 million Instagram posts are added each day
Mobile device data 22 billion text messages were sent each day in 2017. There are 3.5 billion
Data Loading...