Lab 7: Foundations for statistical inference

The Data

This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.

download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")

Let’s take a quick peek at the first few rows of the data.

head(bdims)

You’ll see that for every observation we have 25 measurements, many of which are either diameters or girths. A key to the variable names can be found at http://www.openintro.org/stat/data/bdims.php, but we’ll be focusing on height in cm (hgt), and sex (1 indicates male, 0 indicates female).

Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.

mdims = subset(bdims, sex == 1)
fdims = subset(bdims, sex == 0)

Comparing the z-confidence intervals and the t-confidence intervals

For the mean of a population $\mu$ , in Section 4.2 and Section 5.1, we introduced 3 kinds of confidence intervals.

z-CI with known population SD: $\overline{x}\pm z^* \sigma/\sqrt{n}$ ,
z-CI with unknown population SD: $\overline{x}\pm z^* s/\sqrt{n}$ ,
t-CI: $\overline{x}\pm t^* s/\sqrt{n}$

where $\sigma$ is the SD of the population, and $s$ is the SD of the sample.

The z-CI with known population SD cannot be constructed as the population SD $\sigma$ is usually unknown. The z-CI with unknown population SD replaces the unknown population SD $\sigma$ with the sample SD $s$ . This replacement is justifiable only when the sample size $n$ is large enough. For small samples, one can only use the t-CI. All 3 confidence intervals require a nearly normal population when the sample size is small.

In Lab 7, we will treat the 247 men as our population. The data set has information on many body dimension variables, but we’ll focus on the height of the men, represented by the variable hgt.

population = mdims$hgt

The true values of the mean $\mu$ and the SD $\sigma$ of the population are

mu = mean(population); mu

## [1] 177.7453

sigma = sd(population); sigma

## [1] 7.183629

Simulation

We’ll take 100 simple random samples, each of size $n=4$ , from the population (the 247 men), construct the 3 kinds of confidence intervals at 95% level for each of the 100 samples, and see how many of the intervals actually contain the population mean $\mu.$

For small samples of size 4, all 3 confidence intervals require the normality of the population distribution. Let’s hence exam the histogram of the 247 men’s heights using the histnorm() function we defined in Lab 6. First, we need to load the histnorm() function.

library(ggplot2)
histnorm = function(x, bins = 30, binwidth = NULL, xlab = deparse(substitute(x))){
  df = data.frame(x)
  ggplot(df, aes(x=x)) + 
    geom_histogram(aes(y=..density..), bins = bins,
                   binwidth=binwidth, color="white") +
    xlab(xlab) +
    stat_function(fun = dnorm, 
                  args = list(mean = mean(df$x), sd = sd(df$x)), 
                  lwd = 1, col = 'blue')
}

histnorm(population, binwidth=2.5)

As the histogram is roughly symmetric and bell-shaped, we can say the population is fairly normal and construct the intervals.

Here is the outline:

Obtain a random sample of size $n = 4$ .
Calculate and store mean and SD of the sample.
Repeat steps (1) and (2) 100 times.
Use these stored statistics to calculate many confidence intervals.

We write a simple R function meansd() below that will compute and report the mean and SD for a given sample. Please run the following codes in R.

meansd = function(mydata){
  # input mydata is a numerical vector or matrices
  result = c(mean(mydata), sd(mydata));
  names(result) = c("mean", "sd");
  result
}

To demonstrate the usage of meansd(), we first take a sample of 4 observations from the population and then use meansd() to find the mean and SD of the sample.

samp = sample(population, size = 4)
samp

## [1] 181.1 177.8 177.5 188.0

meansd(samp)

##       mean         sd 
## 181.100000   4.880574

We can use the replicate() function to repeat the procedure above 100 times

samp = as.data.frame(t(replicate(100,meansd(sample(population, size = 4)))))
head(samp)

##      mean        sd
## 1 178.025 13.981029
## 2 178.150  8.644652
## 3 176.500  4.080033
## 4 173.175  5.563797
## 5 180.350  9.256529
## 6 173.575  1.980530

In the output, We see the first 6 rows of 100 rows of summary statistics. The $i$ th row of samp stores the mean and sd of the $i$ th sample. The first column of samp is the sample mean for each of the 100 samples, the 2nd column is the sample SD.

Calculating the z-CI with known population SD $\sigma$ : $\overline{x}\pm 1.96 \sigma/\sqrt{n}$

In this simulation, the population SD $\sigma$ is known (sd(populaiton)=7.183629) so we can construct the interval $\overline{x}\pm 1.96 \sigma/\sqrt{n}$ and assess its performance. The following commands construct the confidence interval $\overline{x}\pm 1.96 \sigma/\sqrt{n}$ for each of the 100 samples, in which $\sigma=$ sd(populaiton)=7.183629 is the population SD.

samp = transform(samp, lower = mean - 1.96 * sigma/sqrt(4))
samp = transform(samp, upper = mean + 1.96 * sigma/sqrt(4))

The R codes above ask R to take the mean stored in samp, and compute the lower and upper bound of the confidence interval.

Lower bounds of these 100 confidence intervals are stored in lower, and the upper bounds are in upper. Let’s view the first 6 intervals.

head(samp)

##      mean        sd   lower   upper
## 1 178.025 13.981029 170.985 185.065
## 2 178.150  8.644652 171.110 185.190
## 3 176.500  4.080033 169.460 183.540
## 4 173.175  5.563797 166.135 180.215
## 5 180.350  9.256529 173.310 187.390
## 6 173.575  1.980530 166.535 180.615

We can plot the 100 confidence intervals and see how many of them contain the true population mean using the plot.ci function below. Please run the following codes in R.

library(ggplot2)
plot.ci = function (dat, mu){
    dat$index = 1:length(dat[,1])
    dat = transform(dat, containMU = upper > mu & lower < mu)
    ggplot(dat, aes(x=lower,y = index, xend=upper, yend=index, color=containMU)) +
    geom_segment(arrow=arrow(ends="both", angle=90, length = unit(0.1,"cm"))) + 
    xlab("mean") + 
    geom_vline(xintercept = mu)
}

And then we can plot our 100 confidence intervals using plot_ci as follows

plot.ci(samp, mu = mean(population))

In the above, we can see the 100 confidence intervals are plotted along with the true population mean marked by a vertical dash line. Those confidence intervals that miss the population are highlighted in red color.

For the 100 CIs constructed, we expect 5 of them to miss the true population mean $\mu$ as the confidence level is 95%. The actual number of the 100 intervals that miss $\mu$ may not be exactly 5, which most likely be between 0-10}, just like tossing a fair coin 100 times one might not get exactly 50 heads.

Note that the 100 intervals above are equal in length because their margin of error $1.96\sigma/\sqrt{n}$ do not vary from sample to sample.

Calculating the z-CI with unknown population SD: $\overline{x}\pm 1.96 s/\sqrt{n}$

Now we are going to construct the second confidence interval $\overline{x}\pm 1.96 s/\sqrt{n}$ for each of the 100 samples, in which $s$ is the sample SD of the sample.

samp = transform(samp, lower = mean - 1.96 * sd/sqrt(4))
samp = transform(samp, upper = mean + 1.96 * sd/sqrt(4))

We plot the 100 confidence intervals as follows.

plot.ci(samp, mu = mean(population))

We can see that obviously over 5% of the intervals above fail to include the true population mean $\mu= 177.7453.$ The actual confidence level of the $z$ -interval is lower than the nomial level 95% when the unknown $\sigma$ is replaced with the sample SD $s$ .

Observe that the 100 intervals above are NOT equal in length because the sample SD $s$ changes from sample to sample.

Observe that the samples with smaller SD $s$ are more likely to miss $\mu$ since their intervals tend to be shorter in length.

Calculating the $t$ -CI : $\overline{x}\pm t^* s/\sqrt{n}$

For a sample of size 4, df $= 4-1=3$ , we can find $t^*=3.18$ from the $t$ -table. The t-CI for each of the 100 samples is calculated and plotted as follows:

samp = transform(samp, lower = mean - 3.18 * sd/sqrt(4))
samp = transform(samp, upper = mean + 3.18 * sd/sqrt(4))

plot.ci(samp, mu = mean(population))

We can see that using the $t$ -interval, the actual confidence level goes back to the nominal level 95%.

On Your Own

Now use the 260 women in the data set as the population. Make a histogram for the height of the 260 women. Do you find any outlier or clear skewness? Find the population mean and SD of the height of the 260 women.
Take 100 samples of size 5 from the population of the 260 women. For each of the 100 samples, construct the 3 kinds of confidence intervals ( $\overline{x}\pm z^*\sigma/\sqrt{n}$ , $\overline{x}\pm z^*s/\sqrt{n}$ , and $\overline{x}\pm t^*s/\sqrt{n}$ , ) for the mean height of the 260 women at 90% level. Use the plot_ci function to plot the 3 kinds of confidence intervals. For each of the 3 kinds of intervals, calculate the proportion of intervals that include the true population mean. Are those percentages close to the nominal level 90%?

This lab was modified by Yibi Hung from a lab written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel. This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.

Lab 7: Foundations for statistical inference – Confidence intervals