Introduction to Data Mining for the Life Sciences

One of the major challenges for the scientific community, a challenge that has been seen in many business disciplines, is the exponential increase in data being generated by new experimental techniques and research. A single microarray experiment, for exa

  • PDF / 21,713,544 Bytes
  • 643 Pages / 439.37 x 666.142 pts Page_size
  • 26 Downloads / 209 Views

DOWNLOAD

REPORT


Rob Sullivan

Introduction to Data Mining for the Life Sciences

Rob Sullivan Cincinnati, OH, USA

ISBN 978-1-58829-942-0 e-ISBN 978-1-59745-290-8 DOI 10.1007/978-1-59745-290-8 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011941596 # Springer Science+Business Media, LLC 2012

All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)

To my wife, without whose support, encouragement, love, and caffeine, none of this would have been possible.

v

Preface

A search for the word “zettabyte” will return a page that predicts that we will enter the zettabyte age around 2015. To store this amount of data on DVDs would require over 215 billion disks. The search itself (on September 20, 2011) returned 775,000 results, many relevant, many irrelevant, and many duplicates. The challenge is to elicit knowledge from all this data. Scientific endeavors are constantly generating more and more data. As new generations of instruments are created, one of the characteristics is typically more sensitive results. In turn, this typically means more data is generated. New techniques made available by new instrumentation, techniques, and understanding allows us to consider approaches such as genome-wide association studies (GWAS) that were outside of our ability to consider just a few years ago. Again, the challenge is to elicit knowledge from all this data. But as we continue to generate this ever-increasing amount of data, we would also like to know what relationships and patterns exist between the data. This, in essence, is the goal of data mining: find the patterns within the data. This is what this book is about. Is there some quantity X that is related to some other quantity Y that isn’t obvious to us? If so, what could those relationships tell us? Is there something novel, something new, that these patterns tell us? Can it advance our knowledge? There is no obvious end in sight to the increasing generation of data. To the contrary, as tools, techniques, and instrumentation continue to become smaller, cheaper, and thus, more available, it is likely that the opposite will be the case and data will continue to be generated in ever-increasing volumes. It is for this reason that automated approaches to processing data, un