Exploratory Data Analysis

Exploratory data analysis is the process by which a person manipulates data with the goal of learning about general patterns or tendencies and finding specific occurrences that deviate from the general patterns. The themes of Revelation, Resistance, Resid

  • PDF / 243,536 Bytes
  • 19 Pages / 439.37 x 666.14 pts Page_size
  • 95 Downloads / 211 Views

DOWNLOAD

REPORT


Exploratory Data Analysis

5.1 Introduction Exploratory data analysis is the process by which a person manipulates data with the goal of learning about general patterns or tendencies and finding specific occurrences that deviate from the general patterns. Much like a detective explores a crime scene, collects evidence and draws conclusions, a statistician explores data using graphical displays and suitable summaries to draw conclusions about the main message of the data. John Tukey and other statisticians have devised a collection of methods helpful in exploring data. Although the specific data analysis techniques are useful, exploratory data analysis is more than the methods – it represents an attitude or philosophy about how data should be explored. Tukey makes a clear distinction between confirmatory data analysis, where one is primarily interested in drawing inferential conclusions, and exploratory methods, where one is placing few assumptions on the distributional shape of the data and simply looking for interesting patterns. Good references on exploratory methods are Tukey [47] and Hoaglin et al. [22]. There are four general themes of exploratory data analysis, namely Revelation, Resistance, Residuals, and Reexpression, collectively called the four R’s. There is a focus on revelation, the use of suitable graphical displays in looking for patterns in data. It is desirable to use resistant methods – these methods are relatively insensitive to extreme observations that deviate from the general patterns. When we fit simple models such as a line, often the main message is not the fitted line, but rather the residuals, the deviations of the data from the line. By looking at residuals, we often learn about data patterns that are difficult to see by the initial data displays. Last, in many situations, it can be difficult to see patterns due to the particular measuring scale of the data. Often there is a need to reexpress or change the scale of the data. Well-chosen reexpressions, such as a log or square root, make it easier

J. Albert and M. Rizzo, R by Example, Use R, DOI 10.1007/978-1-4614-1365-3__5, © Springer Science+Business Media, LLC 2012

133

134

5 Exploratory Data Analysis

to see general patterns and find suitable data summaries. In the following example, we illustrate each of the four “R themes” in exploratory work.

5.2 Meet the Data Example 5.1 (Ratings of colleges). It can be difficult for an American high school student to choose a college. To help in this college selection process, U.S. News and World Report (http: //www.usnews.com) prepares a yearly guide America’s Best Colleges. The 2009 guide ranks all of the colleges in the United States with respect to a number of different criteria. The dataset college.txt contains data in the guide collected from a group of “National Universities.” These are schools in the United States that offer a range of degrees both at the undergraduate and graduate levels. The following variables are collected for each college: a. School – the name of the college b. Tier – the rank o