Lab 5: Normal Distributions

In Lab 5, we’ll investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.

The Data

This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults. The data file can be downloaded at http://www.stat.uchicago.edu/~yibi/s220/labs/data/bdims.txt. Then set the working directory as in Lab01, and load the data using read.table as follows.

bdims = read.table("bdims.txt", header=TRUE)

Let’s take a quick peek at the first few rows of the data.

head(bdims)

##   bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi che.gi
## 1   42.9   26.0   31.5   17.7   28.0   13.1   10.4   18.8   14.1  106.2   89.5
## 2   43.7   28.5   33.5   16.9   30.8   14.0   11.8   20.6   15.1  110.5   97.0
## 3   40.1   28.2   33.3   20.9   31.7   13.9   10.9   19.7   14.1  115.1   97.5
## 4   44.3   29.9   34.0   18.4   28.2   13.9   11.2   20.9   15.0  104.5   97.0
## 5   42.5   29.9   34.0   21.5   29.4   15.2   11.6   20.7   14.9  107.5   97.5
## 6   43.3   27.0   31.5   19.6   31.3   14.0   11.5   18.8   13.9  119.8   99.9
##   wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi wri.gi age
## 1   71.5   74.5   93.5   51.5   32.5   26.0   34.5   36.5   23.5   16.5  21
## 2   79.0   86.5   94.8   51.5   34.4   28.0   36.5   37.5   24.5   17.0  23
## 3   83.2   82.9   95.0   57.3   33.4   28.8   37.0   37.3   21.9   16.9  28
## 4   77.8   78.8   94.0   53.0   31.0   26.2   37.0   34.8   23.0   16.6  23
## 5   80.0   82.5   98.5   55.4   32.0   28.4   37.7   38.6   24.4   18.0  22
## 6   82.5   80.1   95.3   57.5   33.0   28.0   36.6   36.1   23.5   16.9  21
##    wgt   hgt sex
## 1 65.6 174.0   1
## 2 71.8 175.3   1
## 3 80.7 193.5   1
## 4 72.6 186.5   1
## 5 78.8 187.2   1
## 6 74.8 181.5   1

You’ll see that for every observation we have 25 measurements, many of which are either diameters or girths. A key to the variable names can be found at http://www.openintro.org/stat/data/bdims.php, but we’ll be focusing on just three variables to get started: weight in kg (wgt), height in cm (hgt), and sex (1 indicates male, 0 indicates female).

Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.

mdims = subset(bdims, sex == 1)
fdims = subset(bdims, sex == 0)

Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?

The normal distribution

In your description of the distributions, did you use words like bell-shaped or normal? It’s tempting to say so when faced with a unimodal symmetric distribution.

To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. We’ll be working with women’s heights, so let’s store them as a separate object and then calculate some statistics that will be referenced later.

mhgtmean = mean(mdims$hgt)
mhgtsd   = sd(mdims$hgt)

Next we make a density histogram to use as the backdrop and use the fit argument to overlay a normal probability curve. The difference between a frequency histogram and a density histogram is that while in a frequency histogram the heights of the bars add up to the total number of observations, in a density histogram the areas of the bars add up to 1. The area of each bar can be calculated as simply the height times the width of the bar. Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve is a normal probability density function. Frequency and density histograms both display the same exact shape; they only differ in their \(y\)-axis. You can verify this by comparing the frequency histogram you constructed earlier and the density histogram created by the commands below.

library(ggplot2)
histnorm = function(x, binwidth){
  df = data.frame(x)
  ggplot(df, aes(x=x)) + 
    geom_histogram(aes(y=..density..), 
                   binwidth=binwidth, color="white") +
    stat_function(fun = dnorm, 
                  args = list(mean = mean(df$x), sd = sd(df$x)), 
                  lwd = 1, col = 'blue')
}

histnorm(mdims$hgt, binwidth=2.5)

Notice that adding the fit argument to histogram overlays a normal distribution on plot. One can adjust the binwidth by specifying the binwidth value.

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Normal probabilities

Why should we care about whether or not a variable is normally distributed? It turns out that statisticians know a lot about the normal distribution.
Once we decide that a random variable is approximately normal, we can answer all sorts of questions about that variable related to probability. Take, for example, the question of, “What proportion of males in the sample is taller than 6 feet (about 182 cm)?” (The study that published this data set is clear to point out that the sample was not random and therefore inference to a general population is not suggested. We do so here only as an exercise.)

If we assume that men’s heights are normally distributed (a very close approximation is also okay), we can find this proportion by calculating a \(Z\) score and consulting a \(Z\) table (also called a normal probability table).
In R, this is done in one step with the function pnorm.

1- pnorm(q = 182, mean = mhgtmean, sd = mhgtsd)

## [1] 0.2768345

Note that the function pnorm gives the area under the normal curve below a given value, q, the lower-tail probability \(P(x<q)\). Since we’re interested in the probability that someone is taller than 182 cm, we have to take one minus that probability.

The actual proportion of males in the sample that are taller than 182 cm can be found as follows:

with(mdims, sum(hgt > 182) / length(hgt))

## [1] 0.2631579

The command sum(hgt > 182) gives the number of observations in the data set with hgt value greater than 182 , and length(hgt) gives the number of observations in the dataset.

Likewise, if we want to know what percentage of males in the sample are shorter than 5’5 feets (= 165.1 cm), we can compute it with normal approximation and exactly as follows.

pnorm(q = 165.1, mean = mhgtmean, sd = mhgtsd)

## [1] 0.03917845

with(mdims, sum(hgt < 165.1) / length(hgt))

## [1] 0.02834008

Although the probabilities are not exactly the same, they are reasonably close. The closer your distribution is to being normal, the closer the proportions computed based on normal approximation are to the actual proportions.

Normal Percentiles

The \(n\)th percentile of a variable is the value that \(n\) percent of the data values are below it and \(100-n\) percent are above it.

Suppose we want to find the 90th percentile of male heights for this data set. That is, we want to find the height such that 90% of the males in the data set are shorter than it and 10% are taller. If we assume that male heights are (nearly) normally distributed, then we can find the 90th percentile using the R command qnorm

qnorm(0.9, mean = mhgtmean, sd = mhgtsd)

## [1] 186.9515

The R quantile gives the actual percentiles of the data (w/o assumming normality).

quantile(mdims$hgt, prob=0.9)

## 90% 
## 188

Just like normal probabilities, The closer the distribution is to being normal, the closer the percentiles computed based on normal approximations are to the actual percentile of the data.

On Your Own

Using normal approximation, find the percentage of females that are taller than 5’7’’ (170.2 cm) and find the actual percentage based on the data.
Find the 25 percentile of females’ height in the sample using normal approximation and find the percentile based on the data.
Make a histogram of age of females and overlay a normal curve that best fit the histogram. Does the age of females seems to be normally distributed?
Find the exact percentage of females in the sample who were 25 or younger and then by normal approximation. Do you expected these two percentages to be close?

This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics.

This lab can be shared or edited under a Creative Commons Attribution-ShareAlike 3.0 Unported licence.