Categorical data

R commands are introduced for organizing, summarizing, and displaying categorical data.

  • PDF / 257,268 Bytes
  • 21 Pages / 439.37 x 666.14 pts Page_size
  • 36 Downloads / 216 Views

DOWNLOAD

REPORT


Categorical data

3.1 Introduction In this chapter, we introduce R commands for organizing, summarizing, and displaying categorical data. We will see that categorical data is conveniently expressed by a special R object called a factor. The table function is useful in constructing frequency tables and the plot and barplot functions are useful in displaying tabulated output. The chi-square goodness-of-fit test for assessing if a vector of counts follows a specified discrete distribution is implemented in the chisq.test function. The cut function is helpful in dividing a numerical value into a categorical variable using a vector of dividing values. The table function with several variables can be used to construct a twoway frequency table and the prop.table function can be used to compute conditional proportions to explore the association pattern in the table. Sideby-side and segmented bar charts of conditional probabilities are constructed by the barplot function. The hypothesis of independence in a two-way table can be tested by the chisq.test function. A special graphical display mosaicplot can be used to display the counts in a two-way frequency table and, in addition, show the pattern of residuals from a fit of independence.

3.1.1 Tabulating and plotting categorical data Example 3.1 (Flipping a coin). To begin, suppose we flip a coin 20 times and observe the sequence H, T, H, H, T, H, H, T, H, H, T, T, H, T, T, T, H, H, H, T. We are interested in tabulating these outcomes, finding the proportions of heads and tails, and graphing the proportions.

J. Albert and M. Rizzo, R by Example, Use R, DOI 10.1007/978-1-4614-1365-3__3, © Springer Science+Business Media, LLC 2012

79

80

3 Categorical data

A convenient way of entering these data in the R console is with the scan function. One indicates by the what=character argument that character-type data will be entered. By default, this function assumes that “white space” will be separating the individual entries. We complete entering the outcomes by pressing the Enter key on a blank line. The character data is placed in the vector tosses. > tosses = scan(what="character") 1: H T H H T H H T H H T T 13: H T T T 17: H H H T 21: Read 20 items

We can tabulate this coin flipping data using the table function. The output is a table of frequencies of the different outcomes, H and T. > table(tosses) tosses H T 11 9

We see that 11 heads and 9 tails were flipped. To summarize these counts, one typically computes proportions or relative frequencies. One can obtain these proportions by simply dividing the table frequencies by the number of flips using the length function. > table(tosses) / length(tosses) tosses H T 0.55 0.45

There are several ways of displaying these data. First, we save the relative frequency output in the variable prop.tosses: > prop.tosses = table(tosses) / length(tosses)

Using the plot method, we obtain a line graph displayed in Figure 3.1(a). > plot(prop.tosses)

Alternately, one can display the proportions by a bar graph using the barplot function shown in Fig