In Lab 5, we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.
This week we’ll be working with measurements of body dimensions. This
data set contains measurements from 247 men and 260 women, most of whom
were considered healthy young adults. The data file can be downloaded at
http://www.stat.uchicago.edu/~yibi/s220/labs/data/bdims.txt.
Then set the working directory as in Lab01, and load the data using
read.table
as follows.
= read.table("bdims.txt", header=TRUE) bdims
Let’s take a quick peek at the first few rows of the data.
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi che.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2 89.5
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5 97.0
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1 97.5
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5 97.0
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5 97.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8 99.9
## wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi wri.gi age
## 1 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5 16.5 21
## 2 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5 17.0 23
## 3 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9 16.9 28
## 4 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0 16.6 23
## 5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4 18.0 22
## 6 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5 16.9 21
## wgt hgt sex
## 1 65.6 174.0 1
## 2 71.8 175.3 1
## 3 80.7 193.5 1
## 4 72.6 186.5 1
## 5 78.8 187.2 1
## 6 74.8 181.5 1
You’ll see that for every observation we have 25 measurements, many
of which are either diameters or girths. A key to the variable names can
be found at http://www.openintro.org/stat/data/bdims.php,
but we’ll be focusing on just three variables to get started: weight in
kg (wgt
), height in cm (hgt
), and
sex
(1
indicates male, 0
indicates female).
Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.
= subset(bdims, sex == 1)
mdims = subset(bdims, sex == 0) fdims
In your description of the distributions, did you use words like bell-shaped or normal? It’s tempting to say so when faced with a unimodal symmetric distribution.
To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. We’ll be working with women’s heights, so let’s store them as a separate object and then calculate some statistics that will be referenced later.
= mean(mdims$hgt)
mhgtmean = sd(mdims$hgt) mhgtsd
Next we make a density histogram to use as the backdrop and use the
fit
argument to overlay a normal probability curve. The
difference between a frequency histogram and a density histogram is that
while in a frequency histogram the heights of the bars add up
to the total number of observations, in a density histogram the
areas of the bars add up to 1. The area of each bar can be
calculated as simply the height times the width of the bar.
Using a density histogram allows us to properly overlay a normal
distribution curve over the histogram since the curve is a normal
probability density function. Frequency and density histograms both
display the same exact shape; they only differ in their \(y\)-axis. You can verify this by comparing
the frequency histogram you constructed earlier and the density
histogram created by the commands below.
library(ggplot2)
= function(x, binwidth){
histnorm = data.frame(x)
df ggplot(df, aes(x=x)) +
geom_histogram(aes(y=..density..),
binwidth=binwidth, color="white") +
stat_function(fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 1, col = 'blue')
}
histnorm(mdims$hgt, binwidth=2.5)
Notice that adding the fit
argument to
histogram
overlays a normal distribution on plot. One can
adjust the binwidth by specifying the binwidth
value.
Why should we care about whether or not a variable is normally
distributed? It turns out that statisticians know a lot about the normal
distribution.
Once we decide that a random variable is approximately normal, we can
answer all sorts of questions about that variable related to
probability. Take, for example, the question of, “What proportion of
males in the sample is taller than 6 feet (about 182 cm)?” (The study
that published this data set is clear to point out that the sample was
not random and therefore inference to a general population is not
suggested. We do so here only as an exercise.)
If we assume that men’s heights are normally distributed (a very
close approximation is also okay), we can find this proportion by
calculating a \(Z\) score and
consulting a \(Z\) table (also called a
normal probability table).
In R, this is done in one step with the function pnorm
.
1- pnorm(q = 182, mean = mhgtmean, sd = mhgtsd)
## [1] 0.2768345
Note that the function pnorm
gives the area under the
normal curve below a given value, q
, the
lower-tail probability \(P(x<q)\).
Since we’re interested in the probability that someone is
taller than 182 cm, we have to take one minus that
probability.
The actual proportion of males in the sample that are taller than 182 cm can be found as follows:
with(mdims, sum(hgt > 182) / length(hgt))
## [1] 0.2631579
The command sum(hgt > 182)
gives the number of
observations in the data set with hgt
value greater than
182 , and length(hgt)
gives the number of observations in
the dataset.
Likewise, if we want to know what percentage of males in the sample are shorter than 5’5 feets (= 165.1 cm), we can compute it with normal approximation and exactly as follows.
pnorm(q = 165.1, mean = mhgtmean, sd = mhgtsd)
## [1] 0.03917845
with(mdims, sum(hgt < 165.1) / length(hgt))
## [1] 0.02834008
Although the probabilities are not exactly the same, they are reasonably close. The closer your distribution is to being normal, the closer the proportions computed based on normal approximation are to the actual proportions.
The \(n\)th percentile of a variable is the value that \(n\) percent of the data values are below it and \(100-n\) percent are above it.
Suppose we want to find the 90th percentile of male heights for this
data set. That is, we want to find the height such that 90% of the males
in the data set are shorter than it and 10% are taller. If we assume
that male heights are (nearly) normally distributed, then we can find
the 90th percentile using the R command qnorm
qnorm(0.9, mean = mhgtmean, sd = mhgtsd)
## [1] 186.9515
The R quantile
gives the actual percentiles of the data
(w/o assumming normality).
quantile(mdims$hgt, prob=0.9)
## 90%
## 188
Just like normal probabilities, The closer the distribution is to being normal, the closer the percentiles computed based on normal approximations are to the actual percentile of the data.
Using normal approximation, find the percentage of females that are taller than 5’7’’ (170.2 cm) and find the actual percentage based on the data.
Find the 25 percentile of females’ height in the sample using normal approximation and find the percentile based on the data.
Make a histogram of age of females and overlay a normal curve that best fit the histogram. Does the age of females seems to be normally distributed?
Find the exact percentage of females in the sample who were 25 or younger and then by normal approximation. Do you expected these two percentages to be close?
This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics.
This lab can be shared or edited under a Creative Commons Attribution-ShareAlike 3.0 Unported licence.