Big Data

If you look at many leading database vendors’ web sites and you will see that we are in the Big Data era. We explore what this actually means and, using a tutorial, review one of the key concepts in this era—that of MapReduce. We discover that Big Data is

  • PDF / 1,008,315 Bytes
  • 25 Pages / 441 x 666 pts Page_size
  • 47 Downloads / 261 Views

DOWNLOAD

REPORT


Big Data

What the reader will learn: • that Big Data is not just about data volumes • that analysing the data involved is the key to the value of Big Data • how to use tools like Hadoop to explore large data collections and generate information from data • that the structured data traditionally stored in a RDBMS is not the only valuable data source • that a data scientist needs to understand both statistical concepts and the business they are working for

6.1

What Is Big Data? 1 Terabyte = 1024 Gigabytes 1 Petabyte = 1024 Terabytes 1 Exabyte = 1024 Petabytes 1 Zettabyte = 1024 Exabytes

And what does a zettabyte of information look like? According to a Cisco blog (http://blogs.cisco.com/news/the-dawn-of-thezettabyte-era-infographic/) a zettabyte is equivalent to about 250 billion DVDs, and that would take one individual a very long time to watch. And the DVD is a good measure since Cisco go on to predict that by 2015 the majority of global Internet traffic (61 percent) will be in some form of video. So we do mean BIG! EMC2 suggest that 1.8 Zettabytes is the amount of data estimated to be created in 2011. Their site has a growth ticker on it, allowing you to see the amount of data created since January 2011 (http://uk.emc.com/leadership/programs/digitaluniverse.htm). Of course these are estimates, but it helps us get a feel for the scale involved. They go on to suggest that the world’s information is more than doubling every two years. P. Lake, P. Crowther, Concise Guide to Databases, Undergraduate Topics in Computer Science, DOI 10.1007/978-1-4471-5601-7_6, © Springer-Verlag London 2013

135

136

6

Big Data

At the beginning of IBM’s guide to what Big Data is they say (http://www01.ibm. com/software/data/bigdata): Every day, we create 2.5 quintillion bytes of data—so much that 90 % of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. But the most obvious trap to fall into is to believe that Big Data, a new term, is only about large volumes of data. Roger Magoulas from O’Reilly media is credited with the first usage of the term ‘Big Data’ in the way we have come to understand it, in 2005. But as a distinct, well defined topic, it is younger even than that. However Springer’s Very Large Databases (VLDB) Journal has been in existence since 1992. It Examines information system architectures, the impact of technological advancements on information systems, and the development of novel database applications. Whilst early hard disk drives were relatively small, Mainframes had been dealing with large volumes of data since the 1950s. So handling large amounts of data isn’t new, although the scale has doubtless increased in the last few years. Perhaps it isn’t really the actual size, but more to do with whether or not we can meaningfully use and interact with the data? T