The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
You can load the data using read.tabl()
as follows.
= read.table("http://www.stat.uchicago.edu/~yibi/s220/labs/data/cdc.dat", header=TRUE) cdc
If you have trouble setting working directory and loading data file, don’t hesitate to ask TA for help.
To view the names of the variables, type the command
str(cdc)
This returns the names genhlth
, exerany
,
hlthplan
, smoke100
, height
,
weight
, wtdesire
, age
, and
gender
. Each one of these variables corresponds to a
question that was asked in the survey. For example, for
genhlth
, respondents were asked to evaluate their general
health, responding either excellent, very good, good, fair or poor. The
exerany
variable indicates whether the respondent exercised
in the past month (y) or did not (n). Likewise, hlthplan
indicates whether the respondent had some form of health coverage (y) or
did not (n). The smoke100
variable indicates whether the
respondent had smoked at least 100 cigarettes in her lifetime. The other
variables record the respondent’s height
in inches,
weight
in pounds as well as their desired weight,
wtdesire
, age
in years, and
gender
.
You could also look at all of the data frame at once by
typing its name into the console, but that might be unwise here. We know
cdc
has 20,000 rows, so viewing the entire data set would
mean flooding your screen. It’s better to take small peeks at the data
with head
, tail
or the “subset” technique that
we will show in a moment.
While it makes sense to describe a numerical variable like
weight
in terms of summary statistics like mean or sd, for
categorical variables, we would just make a table of counts showing the
counts of observations falling in each category of the variable The R
function xtabs()
can do the counting for us. For example,
the 3 lines below produce the table of counts for the 3 variables:
smoke
, smoke100
, and genhlth
.
xtabs(~smoke100, data=cdc)
## smoke100
## n y
## 10559 9441
xtabs(~gender, data=cdc)
## gender
## f m
## 10431 9569
xtabs(~genhlth, data=cdc)
## genhlth
## excellent fair good poor very good
## 4657 2019 5675 677 6972
One can convert the table of counts to proportions by the function
prop.table.
prop.table(xtabs(~smoke100, data=cdc))
## smoke100
## n y
## 0.52795 0.47205
The xtabs
command can be used to tabulate any number of
variables you provide. For example, to cross tabulate the two variable
smoke100
and genhlth
, we could type
xtabs(~ smoke100 + genhlth, data=cdc)
## genhlth
## smoke100 excellent fair good poor very good
## n 2879 911 2782 229 3758
## y 1778 1108 2893 448 3214
Notice genhlth
is ordinal with five
categories in the order:
“excellent” > “very good” > “good” > “fair” > “poor”.
However, in the two-way table above, we can see R orders the five
categories alphabetically: “excellent”, “fair”, “good”, “poor”. “very
good”. We can specify the order of categories using the
ordered
command.
$genhlth = ordered(cdc$genhlth, levels=c("excellent", "very good", "good", "fair", "poor"))
cdcxtabs(~ smoke100 + genhlth, data=cdc)
## genhlth
## smoke100 excellent very good good fair poor
## n 2879 3758 2782 911 229
## y 1778 3214 2893 1108 448
Now the 5 categories are in the correct order.
The addmargins()
command can add the marginal totals for
the two-way table.
addmargins(xtabs(~ smoke100 + genhlth, data=cdc))
## genhlth
## smoke100 excellent very good good fair poor Sum
## n 2879 3758 2782 911 229 10559
## y 1778 3214 2893 1108 448 9441
## Sum 4657 6972 5675 2019 677 20000
The prop.table()
command can produce
respectively as follows:
prop.table(xtabs(~ smoke100 + genhlth, data=cdc)) # overall proportions
## genhlth
## smoke100 excellent very good good fair poor
## n 0.14395 0.18790 0.13910 0.04555 0.01145
## y 0.08890 0.16070 0.14465 0.05540 0.02240
prop.table(xtabs(~ smoke100 + genhlth, data=cdc),1) # row proportions
## genhlth
## smoke100 excellent very good good fair poor
## n 0.27265840 0.35590492 0.26347192 0.08627711 0.02168766
## y 0.18832751 0.34043004 0.30642940 0.11736045 0.04745260
prop.table(xtabs(~ smoke100 + genhlth, data=cdc),2) # column proportions
## genhlth
## smoke100 excellent very good good fair poor
## n 0.6182091 0.5390132 0.4902203 0.4512135 0.3382570
## y 0.3817909 0.4609868 0.5097797 0.5487865 0.6617430
The xtabs()
command can also cross-tabulate 3
categorical variables and produce 3-way tables like
xtabs(~ smoke100 + genhlth + gender, data=cdc)
## , , gender = f
##
## genhlth
## smoke100 excellent very good good fair poor
## n 1522 2081 1661 585 163
## y 837 1509 1292 550 231
##
## , , gender = m
##
## genhlth
## smoke100 excellent very good good fair poor
## n 1357 1677 1121 326 66
## y 941 1705 1601 558 217
The ggplot
command for making bar plots is
geom_bar()
. Note that bar heights = count in each
category.
library(ggplot2)
ggplot(cdc, aes(x=smoke100))+geom_bar()
To make a segmented bar plot with bars represent
smoke100
and segments represent genhlth
, we
need to set smoke100
as the x-variable and
genhlth
as the fill-variable.
ggplot(cdc, aes(fill=genhlth, x=smoke100)) + geom_bar()
To standardize the segmented bar plot above (i.e., the bar heights
standardized to 1), just add position ="fill"
within
geom_bar()
.
ggplot(cdc, aes(fill=genhlth, x=smoke100)) +
geom_bar(position="fill")
One can change the title of the legend from genhlth
to
“Self-Rated General Health”.
ggplot(cdc, aes(fill=genhlth, x=smoke100)) +
geom_bar(position="fill") +
labs(fill='Self-Rated\nGeneral Health')
Say only the tabular data is available but not the case-by-case raw data, like the Titanic data we see in class. The only data we have is the table below. We don’t have the variable values for each passenger on the ship.
Class Died Survived
1st 122 203
2nd 167 118
3rd 528 178
Crew 673 212
We need to enter the tabular data in R ourselves. First we create a 4x2 matrix. Note that the values in the table are entered by column, and we specify there are 4 rows in the matrix. Then R automatically knows that there are 2 columns since there are 8 entries in the matrix.
= matrix(c(122, 167, 528, 673, 203, 118, 178, 212), nrow=4)
titanic
titanic## [,1] [,2]
## [1,] 122 203
## [2,] 167 118
## [3,] 528 178
## [4,] 673 212
Next we specify the names of the row and the column variables and their categories.
dimnames(titanic) = list(Class=c("1st","2nd","3rd","Crew"),
Survival=c("Died","Survived"))
titanic## Survival
## Class Died Survived
## 1st 122 203
## 2nd 167 118
## 3rd 528 178
## Crew 673 212
Next, we convert the matrix to a table using as.table()
and the convert the table to a data frame using
as.data.frame()
. We need to convert the table to a data
frame since ggplot
only takes data frame but not table.
= as.table(titanic)
titanic = as.data.frame(titanic)
titanic.df
titanic.df## Class Survival Freq
## 1 1st Died 122
## 2 2nd Died 167
## 3 3rd Died 528
## 4 Crew Died 673
## 5 1st Survived 203
## 6 2nd Survived 118
## 7 3rd Survived 178
## 8 Crew Survived 212
Note that the titanic.df
data frame includes a column
Freq
= the counts in the table.
Now we can make the bar plot in ggplot
as follows. Note
that we need to specify weight=Freq
.
ggplot(titanic.df, aes(x=Class, weight=Freq, fill =Survival)) +
geom_bar()
ggplot(titanic.df, aes(x=Class, weight=Freq, fill =Survival)) +
geom_bar()
To make the standardize the segmented bar plot, just add
position ="fill"
within geom_bar()
.
ggplot(titanic.df, aes(x=Class, weight=Freq, fill =Survival)) +
geom_bar(position="fill")
gender
and exerany
. How many males are in the sample? What
proportion of the sample reports being in excellent health?There is not a specific function geom_XXX()
in the
ggplot
library for making mosaic plots. Hence, we will just
use the mosaicplot()
function in base R. Plots produced by
mosaicplot()
don’t look as pretty as those by
ggplot
but it works.
Unlike ggplot
that requires data input to be a data
frame, mosaicplot()
requires the data to be a table.
Recall the two-way table of smoke100
and
genhlth
is
xtabs(~ smoke100 + genhlth, data=cdc)
## genhlth
## smoke100 excellent very good good fair poor
## n 2879 3758 2782 911 229
## y 1778 3214 2893 1108 448
To create a mosaic plot for the table above, just feed the table
above to mosaicplot()
.
mosaicplot(xtabs(~ smoke100+genhlth, data=cdc), color=TRUE, main="")
As plots should always be properly labeled, the next line shows how
to change the title, x-label and y-label of the mosaic plot. The
las=1
option makes the labels perpendicular to the axes so
that they don’t overlap.
mosaicplot(xtabs(~ smoke100+genhlth, data=cdc),
main="Smoking and General Health",
ylab="Self-Rated General Health",
xlab="Smoked 100+ Cigarettes Lifetime",
las=2, color=TRUE)
For the titanic data, as we have entered the data as a table in R, we
can feed the table to mosaicplot()
to create the plot.
mosaicplot(titanic, main="", color=TRUE)
This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics.
This lab can be shared or edited under a Creative Commons Attribution-ShareAlike 3.0 Unported licence.
s