STAT 220 Lab 3 - Exploring Categorical Data

1. The CDC Data

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

You can load the data using read.tabl() as follows.

cdc = read.table("http://www.stat.uchicago.edu/~yibi/s220/labs/data/cdc.dat", header=TRUE)

If you have trouble setting working directory and loading data file, don’t hesitate to ask TA for help.

To view the names of the variables, type the command

str(cdc)

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (y) or did not (n). Likewise, hlthplan indicates whether the respondent had some form of health coverage (y) or did not (n). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

How many cases are there in this data set? How many variables? For each variable, identify its variable type (e.g. numerical–continuous, numerical– discrete, categorical–ordinal, categorical–nominal).

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head, tail or the “subset” technique that we will show in a moment.

2. Summarizing and Visualizing Categorical Data

2.1 Tables

While it makes sense to describe a numerical variable like weight in terms of summary statistics like mean or sd, for categorical variables, we would just make a table of counts showing the counts of observations falling in each category of the variable The R function xtabs() can do the counting for us. For example, the 3 lines below produce the table of counts for the 3 variables: smoke, smoke100, and genhlth.

xtabs(~smoke100, data=cdc)
## smoke100
##     n     y 
## 10559  9441
xtabs(~gender, data=cdc)
## gender
##     f     m 
## 10431  9569
xtabs(~genhlth, data=cdc)
## genhlth
## excellent      fair      good      poor very good 
##      4657      2019      5675       677      6972

One can convert the table of counts to proportions by the function prop.table.

prop.table(xtabs(~smoke100, data=cdc))
## smoke100
##       n       y 
## 0.52795 0.47205

The xtabs command can be used to tabulate any number of variables you provide. For example, to cross tabulate the two variable smoke100 and genhlth, we could type

xtabs(~ smoke100 + genhlth, data=cdc)
##         genhlth
## smoke100 excellent fair good poor very good
##        n      2879  911 2782  229      3758
##        y      1778 1108 2893  448      3214

Notice genhlth is ordinal with five categories in the order:

“excellent” > “very good” > “good” > “fair” > “poor”.

However, in the two-way table above, we can see R orders the five categories alphabetically: “excellent”, “fair”, “good”, “poor”. “very good”. We can specify the order of categories using the ordered command.

cdc$genhlth = ordered(cdc$genhlth, levels=c("excellent", "very good", "good", "fair", "poor"))
xtabs(~ smoke100 + genhlth, data=cdc)
##         genhlth
## smoke100 excellent very good good fair poor
##        n      2879      3758 2782  911  229
##        y      1778      3214 2893 1108  448

Now the 5 categories are in the correct order.

The addmargins() command can add the marginal totals for the two-way table.

addmargins(xtabs(~ smoke100 + genhlth, data=cdc))
##         genhlth
## smoke100 excellent very good  good  fair  poor   Sum
##      n        2879      3758  2782   911   229 10559
##      y        1778      3214  2893  1108   448  9441
##      Sum      4657      6972  5675  2019   677 20000

The prop.table() command can produce

the overall proportions = (cell counts)/(total number of cases),
the row proportions = (cell counts)/(row total), and
the column proportions = (cell counts)/(column total).

respectively as follows:

prop.table(xtabs(~ smoke100 + genhlth, data=cdc)) # overall proportions
##         genhlth
## smoke100 excellent very good    good    fair    poor
##        n   0.14395   0.18790 0.13910 0.04555 0.01145
##        y   0.08890   0.16070 0.14465 0.05540 0.02240
prop.table(xtabs(~ smoke100 + genhlth, data=cdc),1) # row proportions
##         genhlth
## smoke100  excellent  very good       good       fair       poor
##        n 0.27265840 0.35590492 0.26347192 0.08627711 0.02168766
##        y 0.18832751 0.34043004 0.30642940 0.11736045 0.04745260
prop.table(xtabs(~ smoke100 + genhlth, data=cdc),2) # column proportions
##         genhlth
## smoke100 excellent very good      good      fair      poor
##        n 0.6182091 0.5390132 0.4902203 0.4512135 0.3382570
##        y 0.3817909 0.4609868 0.5097797 0.5487865 0.6617430

The xtabs() command can also cross-tabulate 3 categorical variables and produce 3-way tables like

xtabs(~ smoke100 + genhlth + gender, data=cdc)
## , , gender = f
## 
##         genhlth
## smoke100 excellent very good good fair poor
##        n      1522      2081 1661  585  163
##        y       837      1509 1292  550  231
## 
## , , gender = m
## 
##         genhlth
## smoke100 excellent very good good fair poor
##        n      1357      1677 1121  326   66
##        y       941      1705 1601  558  217

2.2 Bar Plots (or Bar Charts)

The ggplot command for making bar plots is geom_bar(). Note that bar heights = count in each category.

library(ggplot2)
ggplot(cdc, aes(x=smoke100))+geom_bar()

To make a segmented bar plot with bars represent smoke100 and segments represent genhlth, we need to set smoke100 as the x-variable and genhlth as the fill-variable.

ggplot(cdc, aes(fill=genhlth, x=smoke100)) + geom_bar()

To standardize the segmented bar plot above (i.e., the bar heights standardized to 1), just add position ="fill" within geom_bar().

ggplot(cdc, aes(fill=genhlth, x=smoke100)) + 
  geom_bar(position="fill")

One can change the title of the legend from genhlth to “Self-Rated General Health”.

ggplot(cdc, aes(fill=genhlth, x=smoke100)) + 
  geom_bar(position="fill") +
  labs(fill='Self-Rated\nGeneral Health')

2.3 Bar Plots for Tabular Data

Say only the tabular data is available but not the case-by-case raw data, like the Titanic data we see in class. The only data we have is the table below. We don’t have the variable values for each passenger on the ship.

Class  Died Survived
  1st   122      203
  2nd   167      118
  3rd   528      178
  Crew  673      212

We need to enter the tabular data in R ourselves. First we create a 4x2 matrix. Note that the values in the table are entered by column, and we specify there are 4 rows in the matrix. Then R automatically knows that there are 2 columns since there are 8 entries in the matrix.

titanic = matrix(c(122, 167, 528, 673, 203, 118, 178, 212), nrow=4)
titanic
##      [,1] [,2]
## [1,]  122  203
## [2,]  167  118
## [3,]  528  178
## [4,]  673  212

Next we specify the names of the row and the column variables and their categories.

dimnames(titanic) = list(Class=c("1st","2nd","3rd","Crew"),  
                         Survival=c("Died","Survived"))
titanic
##       Survival
## Class  Died Survived
##   1st   122      203
##   2nd   167      118
##   3rd   528      178
##   Crew  673      212

Next, we convert the matrix to a table using as.table() and the convert the table to a data frame using as.data.frame(). We need to convert the table to a data frame since ggplot only takes data frame but not table.

titanic = as.table(titanic)
titanic.df = as.data.frame(titanic)
titanic.df
##   Class Survival Freq
## 1   1st     Died  122
## 2   2nd     Died  167
## 3   3rd     Died  528
## 4  Crew     Died  673
## 5   1st Survived  203
## 6   2nd Survived  118
## 7   3rd Survived  178
## 8  Crew Survived  212

Note that the titanic.df data frame includes a column Freq = the counts in the table.

Now we can make the bar plot in ggplot as follows. Note that we need to specify weight=Freq.

ggplot(titanic.df, aes(x=Class, weight=Freq, fill =Survival)) +
  geom_bar()

ggplot(titanic.df, aes(x=Class, weight=Freq, fill =Survival)) +
  geom_bar()

To make the standardize the segmented bar plot, just add position ="fill" within geom_bar().

ggplot(titanic.df, aes(x=Class, weight=Freq, fill =Survival)) +
  geom_bar(position="fill")

Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

2.4 Mosaic Plots

There is not a specific function geom_XXX() in the ggplot library for making mosaic plots. Hence, we will just use the mosaicplot() function in base R. Plots produced by mosaicplot() don’t look as pretty as those by ggplot but it works.

Unlike ggplot that requires data input to be a data frame, mosaicplot() requires the data to be a table.

Recall the two-way table of smoke100 and genhlth is

xtabs(~ smoke100 + genhlth, data=cdc)
##         genhlth
## smoke100 excellent very good good fair poor
##        n      2879      3758 2782  911  229
##        y      1778      3214 2893 1108  448

To create a mosaic plot for the table above, just feed the table above to mosaicplot().

mosaicplot(xtabs(~ smoke100+genhlth, data=cdc), color=TRUE, main="")

As plots should always be properly labeled, the next line shows how to change the title, x-label and y-label of the mosaic plot. The las=1 option makes the labels perpendicular to the axes so that they don’t overlap.

mosaicplot(xtabs(~ smoke100+genhlth, data=cdc), 
           main="Smoking and General Health", 
           ylab="Self-Rated General Health", 
           xlab="Smoked 100+ Cigarettes Lifetime",
           las=2, color=TRUE)

For the titanic data, as we have entered the data as a table in R, we can feed the table to mosaicplot() to create the plot.

mosaicplot(titanic, main="", color=TRUE)

This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics.

This lab can be shared or edited under a Creative Commons Attribution-ShareAlike 3.0 Unported licence.