In this lab, we will again make use of the lattice
and tidyverse
packages:
library(lattice)
library(tidyverse)
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
The data file can be downloaded at [https://www.openintro.org/book/statdata/cdc.csv]. Then set the working directory as in Lab01, and load the data using read.table
as follows.
cdc = read.table("cdc.dat", header=TRUE)
If you have trouble setting working directory and loading data file, don't hesitate to ask TA for help. To view the names of the variables, type the command
str(cdc)
This returns the names genhlth
, exerany
, hlthplan
, smoke100
, height
, weight
, wtdesire
, age
, and gender
. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth
, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany
variable indicates whether the respondent exercised in the past month (y) or did not (n). Likewise, hlthplan
indicates whether the respondent had some form of health coverage (y) or did not (n). The smoke100
variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent's height
in inches, weight
in pounds as well as their desired weight, wtdesire
, age
in years, and gender
.
You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc
has 20,000 rows, so viewing the entire data set would mean flooding your screen. It's better to take small peeks at the data with head
, tail
or the subsetting techniques that you'll learn in a moment.
While it makes sense to describe a numerical variable like weight
in terms of summary statistics like mean or sd, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function tally
does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
table(cdc$smoke100)
##
## n y
## 10559 9441
One can convert the table of counts to proportions by the funstion prop.table.
prop.table(table(cdc$smoke100))
##
## n y
## 0.52795 0.47205
The table
command can be used to tabulate any number of variables that you provide. For example, to cross tablulate the two variable smoke100
and genhlth
, we could type
table(cdc$smoke100, cdc$genhlth)
##
## excellent fair good poor very good
## n 2879 911 2782 229 3758
## y 1778 1108 2893 448 3214
We can add marginal totals to the two-way table by the addmargins
command.
addmargins(table(cdc$smoke100, cdc$genhlth))
##
## excellent fair good poor very good Sum
## n 2879 911 2782 229 3758 10559
## y 1778 1108 2893 448 3214 9441
## Sum 4657 2019 5675 677 6972 20000
If one wants to view the proportions instead of counts, one can use:
prop.table(table(cdc$smoke100, cdc$genhlth)) # overall proportions
##
## excellent fair good poor very good
## n 0.14395 0.04555 0.13910 0.01145 0.18790
## y 0.08890 0.05540 0.14465 0.02240 0.16070
prop.table(table(cdc$smoke100, cdc$genhlth),1) # row proportions
##
## excellent fair good poor very good
## n 0.27265840 0.08627711 0.26347192 0.02168766 0.35590492
## y 0.18832751 0.11736045 0.30642940 0.04745260 0.34043004
prop.table(table(cdc$smoke100, cdc$genhlth),2) # column proportions
##
## excellent fair good poor very good
## n 0.6182091 0.4512135 0.4902203 0.3382570 0.5390132
## y 0.3817909 0.5487865 0.5097797 0.6617430 0.4609868
The first line gives the overall proportions(dividing the counts by the total number of cases), the second lines gives the row proportions (dividing the counts by the row totals), the last lines gives the column proportions (dividing the counts by the column totals), Likewise, three-way tables cross-tabulating 3 categorical variables can be created as.
table(cdc$smoke100, cdc$genhlth, cdc$gender)
## , , = f
##
##
## excellent fair good poor very good
## n 1522 585 1661 163 2081
## y 837 550 1292 231 1509
##
## , , = m
##
##
## excellent fair good poor very good
## n 1357 326 1121 66 1677
## y 941 558 1601 217 1705
Next, we make a bar chart of the entries in the table by putting the table inside the barchart
command.
barchart(table(cdc$smoke100), horizontal=FALSE)
Notice what we've done here! We've computed the table of cdc$smoke100
and then immediately applied the graphical function, barchart
. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:
smoke = table(cdc$smoke100)
barchart(smoke, horizontal=FALSE)
Here, we've made a new object, a table, called smoke
(the contents of which we can see by typing smoke
into the console) and then used it in as the input for barchart
. The special symbol =
performs an assignment, taking the output of one line of code and saving it into an object in your workspace.
This is another important idea that we'll return to later.
To create a segmented bar chart representing a two-way table, we can create the two-way table first, and then barchart
the table.
smoke.hlth = table(cdc$smoke100, cdc$genhlth)
barchart(smoke.hlth, horizontal=FALSE)
To get the standardized version of the barchart above (standardized the heights of all the bars), we can first convert the table to proportions and then make barchart
it.
barchart(prop.table(smoke.hlth,1), horizontal=FALSE)
In the barcharts above, it's better to add a legend for the colors used. One can add a legend using the auto.key
option.
barchart(prop.table(smoke.hlth,1), horizontal=FALSE, auto.key=TRUE)
The default location of the legend is on the top. To move the legend to the right, and to add an title to it, one can do the following
barchart(prop.table(smoke.hlth,1), horizontal=FALSE,
auto.key=list(space="right", title="General Health"))
From the above, we can see the size of the legend title is too big. We can adjust the size of the legend title as follows.
barchart(prop.table(smoke.hlth,1), horizontal=FALSE,
auto.key=list(space="right", title="General Health",cex.title=1.2))
gender
and exerany
. How many males are in the sample? What proportion of the sample reports being in excellent health?To investigate whether exercise has any effect on what people feel about their general health, we may look at the two-way table
exer.hlth = table(cdc$exerany, cdc$genhlth)
To create a mosaic plot of this table, we would enter the following command.
mosaicplot(exer.hlth, color=TRUE)
As a general rule, plots should always be properly labelled. The next line shows how to change the title, x-label and y-label of the mosaic plot. The `las=1' option makes the labels perpendicular to the axes so that they don't overlap.
mosaicplot(exer.hlth,
main="Exercise and general health",
ylab="Self-Rated General Health",
xlab="Exercise in the Past Month",
las=2, color=TRUE)
It's often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like
cdc$gender == "m"
or
cdc$age > 65
These commands produce a series of TRUE
and FALSE
values. There is one value for each respondent, where TRUE
indicates that the person was male (via the first command) or older than 65 (second command).
Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset
to do that for us. For example, the command
mdata = subset(cdc, gender == "male")
will create a new data set called mdata
that contains only the men from the cdc
data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual
head(mdata)
This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we'll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.
As an aside, you can use several of these conditions together with &
and |
. The &
is read "and" so that
m_and_over65 = subset(cdc, gender == "male" & age > 65)
will give you the data for men over the age of 65. The |
character is read "or" so that
m_or_over65 = subset(cdc, gender == "male" | age > 65)
will take people who are men or over the age of 65 (who are males eligible to Medicare). In principle, you may use as many "and" and "or" clauses as you like when forming a subset, and we might be interested in their self-rated health status and exercise activity.
exer.hlth.over65 = exer.hlth = table(m_or_over65$exerany, m_or_over65$genhlth)
mosaicplot(exer.hlth.over65,
main="Male Over 65",
ylab="Self-Rated General Health",
xlab="Exercise in the Past Month",
las=2,
color=TRUE)
under23_and_smoke
that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise. What percentage of them are males? What percentage of them exercise at least once in the past month?We've found an interesting association between smoking and gender. We've also picked up essential computing tools -- summary statistics, subsetting, and plots -- that will serve us well throughout this course.
Let's consider a new variable: the difference between desired weight (wtdesire
) and current weight (weight
). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff
.
What type of data is wdiff
? If an observation wdiff
is 0, what does this mean about the person's weight and desired weight. What if wdiff
is positive or negative?
Describe the distribution of wdiff
in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
Find the mean and standard deviation of weight
and determine what proportion of the weights are within one standard deviation of the mean.
This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics. This lab can be shared or edited under a Creative Commons Attribution-ShareAlike 3.0 Unported licence.