STAT 220 Lab 2 - Exploring Categorical Data

In this lab, we will again make use of the lattice and tidyverse packages:

library(lattice)
library(tidyverse)

1. The CDC Data

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

The data file can be downloaded at [https://www.openintro.org/book/statdata/cdc.csv]. Then set the working directory as in Lab01, and load the data using read.table as follows.

cdc = read.table("cdc.dat", header=TRUE)

If you have trouble setting working directory and loading data file, don't hesitate to ask TA for help. To view the names of the variables, type the command

str(cdc)

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (y) or did not (n). Likewise, hlthplan indicates whether the respondent had some form of health coverage (y) or did not (n). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent's height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

How many cases are there in this data set? How many variables? For each variable, identify its variable type (e.g. numerical--continuous, numerical-- discrete, categorical--ordinal, categorical--nominal).

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It's better to take small peeks at the data with head, tail or the subsetting techniques that you'll learn in a moment.

2. Summarizing and Visualizing Categorical Data

2.1 Tables

While it makes sense to describe a numerical variable like weight in terms of summary statistics like mean or sd, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function tally does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

table(cdc$smoke100)

## 
##     n     y 
## 10559  9441

One can convert the table of counts to proportions by the funstion prop.table.

prop.table(table(cdc$smoke100))

## 
##       n       y 
## 0.52795 0.47205

The table command can be used to tabulate any number of variables that you provide. For example, to cross tablulate the two variable smoke100 and genhlth, we could type

table(cdc$smoke100, cdc$genhlth)

##    
##     excellent fair good poor very good
##   n      2879  911 2782  229      3758
##   y      1778 1108 2893  448      3214

We can add marginal totals to the two-way table by the addmargins command.

addmargins(table(cdc$smoke100, cdc$genhlth))

##      
##       excellent  fair  good  poor very good   Sum
##   n        2879   911  2782   229      3758 10559
##   y        1778  1108  2893   448      3214  9441
##   Sum      4657  2019  5675   677      6972 20000

If one wants to view the proportions instead of counts, one can use:

prop.table(table(cdc$smoke100, cdc$genhlth))       # overall proportions

##    
##     excellent    fair    good    poor very good
##   n   0.14395 0.04555 0.13910 0.01145   0.18790
##   y   0.08890 0.05540 0.14465 0.02240   0.16070

prop.table(table(cdc$smoke100, cdc$genhlth),1)     # row proportions

##    
##      excellent       fair       good       poor  very good
##   n 0.27265840 0.08627711 0.26347192 0.02168766 0.35590492
##   y 0.18832751 0.11736045 0.30642940 0.04745260 0.34043004

prop.table(table(cdc$smoke100, cdc$genhlth),2)     # column proportions

##    
##     excellent      fair      good      poor very good
##   n 0.6182091 0.4512135 0.4902203 0.3382570 0.5390132
##   y 0.3817909 0.5487865 0.5097797 0.6617430 0.4609868

The first line gives the overall proportions(dividing the counts by the total number of cases), the second lines gives the row proportions (dividing the counts by the row totals), the last lines gives the column proportions (dividing the counts by the column totals), Likewise, three-way tables cross-tabulating 3 categorical variables can be created as.

table(cdc$smoke100, cdc$genhlth, cdc$gender)

## , ,  = f
## 
##    
##     excellent fair good poor very good
##   n      1522  585 1661  163      2081
##   y       837  550 1292  231      1509
## 
## , ,  = m
## 
##    
##     excellent fair good poor very good
##   n      1357  326 1121   66      1677
##   y       941  558 1601  217      1705

2.2 Bar Plots (or Bar Charts)

Next, we make a bar chart of the entries in the table by putting the table inside the barchart command.

barchart(table(cdc$smoke100), horizontal=FALSE)

Notice what we've done here! We've computed the table of cdc$smoke100 and then immediately applied the graphical function, barchart. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:

smoke = table(cdc$smoke100)
barchart(smoke, horizontal=FALSE)

Here, we've made a new object, a table, called smoke (the contents of which we can see by typing smoke into the console) and then used it in as the input for barchart. The special symbol = performs an assignment, taking the output of one line of code and saving it into an object in your workspace.
This is another important idea that we'll return to later.

To create a segmented bar chart representing a two-way table, we can create the two-way table first, and then barchart the table.

smoke.hlth = table(cdc$smoke100, cdc$genhlth)
barchart(smoke.hlth, horizontal=FALSE)

To get the standardized version of the barchart above (standardized the heights of all the bars), we can first convert the table to proportions and then make barchart it.

barchart(prop.table(smoke.hlth,1), horizontal=FALSE)

In the barcharts above, it's better to add a legend for the colors used. One can add a legend using the auto.key option.

barchart(prop.table(smoke.hlth,1), horizontal=FALSE, auto.key=TRUE)

The default location of the legend is on the top. To move the legend to the right, and to add an title to it, one can do the following

barchart(prop.table(smoke.hlth,1), horizontal=FALSE, 
         auto.key=list(space="right", title="General Health"))

From the above, we can see the size of the legend title is too big. We can adjust the size of the legend title as follows.

barchart(prop.table(smoke.hlth,1), horizontal=FALSE, 
         auto.key=list(space="right", title="General Health",cex.title=1.2))

Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

2.3 Mosaic Plots

To investigate whether exercise has any effect on what people feel about their general health, we may look at the two-way table

exer.hlth = table(cdc$exerany, cdc$genhlth)

To create a mosaic plot of this table, we would enter the following command.

mosaicplot(exer.hlth, color=TRUE)

As a general rule, plots should always be properly labelled. The next line shows how to change the title, x-label and y-label of the mosaic plot. The `las=1' option makes the labels perpendicular to the axes so that they don't overlap.

mosaicplot(exer.hlth, 
           main="Exercise and general health", 
           ylab="Self-Rated General Health", 
           xlab="Exercise in the Past Month",
           las=2, color=TRUE)

What does the mosaic plot reveal about smoking habits and general health status?

3.Subsetting

It's often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like

cdc$gender == "m"

cdc$age > 65

These commands produce a series of TRUE and FALSE values. There is one value for each respondent, where TRUE indicates that the person was male (via the first command) or older than 65 (second command).

Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset to do that for us. For example, the command

mdata = subset(cdc, gender == "male")

will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual

head(mdata)

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we'll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.

As an aside, you can use several of these conditions together with & and |. The & is read "and" so that

m_and_over65 = subset(cdc, gender == "male" & age > 65)

will give you the data for men over the age of 65. The | character is read "or" so that

m_or_over65 = subset(cdc, gender == "male" | age > 65)

will take people who are men or over the age of 65 (who are males eligible to Medicare). In principle, you may use as many "and" and "or" clauses as you like when forming a subset, and we might be interested in their self-rated health status and exercise activity.

exer.hlth.over65 = exer.hlth = table(m_or_over65$exerany, m_or_over65$genhlth)
mosaicplot(exer.hlth.over65, 
           main="Male Over 65", 
           ylab="Self-Rated General Health", 
           xlab="Exercise in the Past Month",
           las=2, 
           color=TRUE)

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise. What percentage of them are males? What percentage of them exercise at least once in the past month?

4. Ending

We've found an interesting association between smoking and gender. We've also picked up essential computing tools -- summary statistics, subsetting, and plots -- that will serve us well throughout this course.

5. On Your Own

Let's consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person's weight and desired weight. What if wdiff is positive or negative?
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

Disclaimer

This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics. This lab can be shared or edited under a Creative Commons Attribution-ShareAlike 3.0 Unported licence.