This week we’ll be working with measurements of body dimensions. This data set contains measurements from 247 men and 260 women, most of whom were considered healthy young adults.
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
Let’s take a quick peek at the first few rows of the data.
head(bdims)
You’ll see that for every observation we have 25 measurements, many
of which are either diameters or girths. A key to the variable names can
be found at http://www.openintro.org/stat/data/bdims.php,
but we’ll be focusing on height in cm (hgt
), and
sex
(1
indicates male, 0
indicates female).
Since males and females tend to have different body dimensions, it will be useful to create two additional data sets: one with only men and another with only women.
= subset(bdims, sex == 1)
mdims = subset(bdims, sex == 0) fdims
For the mean of a population \(\mu\), in Section 4.2 and Section 5.1, we introduced 3 kinds of confidence intervals.
where \(\sigma\) is the SD of the population, and \(s\) is the SD of the sample.
The z-CI with known population SD cannot be constructed as the population SD \(\sigma\) is usually unknown. The z-CI with unknown population SD replaces the unknown population SD \(\sigma\) with the sample SD \(s\). This replacement is justifiable only when the sample size \(n\) is large enough. For small samples, one can only use the t-CI. All 3 confidence intervals require a nearly normal population when the sample size is small.
In Lab 7, we will treat the 247 men as our population. The data set
has information on many body dimension variables, but we’ll focus on the
height of the men, represented by the variable hgt
.
= mdims$hgt population
The true values of the mean \(\mu\) and the SD \(\sigma\) of the population are
= mean(population); mu mu
## [1] 177.7453
= sd(population); sigma sigma
## [1] 7.183629
We’ll take 100 simple random samples, each of size \(n=4\), from the population (the 247 men), construct the 3 kinds of confidence intervals at 95% level for each of the 100 samples, and see how many of the intervals actually contain the population mean \(\mu.\)
For small samples of size 4, all 3 confidence intervals require the
normality of the population distribution. Let’s hence exam the histogram
of the 247 men’s heights using the histnorm()
function we
defined in Lab 6. First, we need to load the histnorm()
function.
library(ggplot2)
= function(x, bins = 30, binwidth = NULL, xlab = deparse(substitute(x))){
histnorm = data.frame(x)
df ggplot(df, aes(x=x)) +
geom_histogram(aes(y=..density..), bins = bins,
binwidth=binwidth, color="white") +
xlab(xlab) +
stat_function(fun = dnorm,
args = list(mean = mean(df$x), sd = sd(df$x)),
lwd = 1, col = 'blue')
}
histnorm(population, binwidth=2.5)
As the histogram is roughly symmetric and bell-shaped, we can say the population is fairly normal and construct the intervals.
Here is the outline:
We write a simple R function meansd()
below that will
compute and report the mean and SD for a given sample. Please run the
following codes in R.
= function(mydata){
meansd # input mydata is a numerical vector or matrices
= c(mean(mydata), sd(mydata));
result names(result) = c("mean", "sd");
result }
To demonstrate the usage of meansd()
, we first take a
sample of 4 observations from the population and then use
meansd()
to find the mean and SD of the sample.
= sample(population, size = 4)
samp samp
## [1] 181.1 177.8 177.5 188.0
meansd(samp)
## mean sd
## 181.100000 4.880574
We can use the replicate()
function to repeat the
procedure above 100 times
= as.data.frame(t(replicate(100,meansd(sample(population, size = 4)))))
samp head(samp)
## mean sd
## 1 178.025 13.981029
## 2 178.150 8.644652
## 3 176.500 4.080033
## 4 173.175 5.563797
## 5 180.350 9.256529
## 6 173.575 1.980530
In the output, We see the first 6 rows of 100 rows of summary
statistics. The \(i\)th row of
samp
stores the mean and sd of the \(i\)th sample. The first column of
samp
is the sample mean for each of the 100 samples, the
2nd column is the sample SD.
In this simulation, the population SD \(\sigma\) is known
(sd(populaiton)
=7.183629) so we can construct the interval
\(\overline{x}\pm 1.96
\sigma/\sqrt{n}\) and assess its performance. The following
commands construct the confidence interval \(\overline{x}\pm 1.96 \sigma/\sqrt{n}\) for
each of the 100 samples, in which \(\sigma=\)sd(populaiton)
=7.183629
is the population SD.
= transform(samp, lower = mean - 1.96 * sigma/sqrt(4))
samp = transform(samp, upper = mean + 1.96 * sigma/sqrt(4)) samp
The R codes above ask R to take the mean
stored in
samp
, and compute the lower and upper bound of the
confidence interval.
Lower bounds of these 100 confidence intervals are stored in
lower
, and the upper bounds are in upper
.
Let’s view the first 6 intervals.
head(samp)
## mean sd lower upper
## 1 178.025 13.981029 170.985 185.065
## 2 178.150 8.644652 171.110 185.190
## 3 176.500 4.080033 169.460 183.540
## 4 173.175 5.563797 166.135 180.215
## 5 180.350 9.256529 173.310 187.390
## 6 173.575 1.980530 166.535 180.615
We can plot the 100 confidence intervals and see how many of them
contain the true population mean using the plot.ci
function
below. Please run the following codes in R.
library(ggplot2)
= function (dat, mu){
plot.ci $index = 1:length(dat[,1])
dat= transform(dat, containMU = upper > mu & lower < mu)
dat ggplot(dat, aes(x=lower,y = index, xend=upper, yend=index, color=containMU)) +
geom_segment(arrow=arrow(ends="both", angle=90, length = unit(0.1,"cm"))) +
xlab("mean") +
geom_vline(xintercept = mu)
}
And then we can plot our 100 confidence intervals using
plot_ci
as follows
plot.ci(samp, mu = mean(population))
In the above, we can see the 100 confidence intervals are plotted along with the true population mean marked by a vertical dash line. Those confidence intervals that miss the population are highlighted in red color.
For the 100 CIs constructed, we expect 5 of them to miss the true population mean \(\mu\) as the confidence level is 95%. The actual number of the 100 intervals that miss \(\mu\) may not be exactly 5, which most likely be between 0-10}, just like tossing a fair coin 100 times one might not get exactly 50 heads.
Note that the 100 intervals above are equal in length because their margin of error \(1.96\sigma/\sqrt{n}\) do not vary from sample to sample.
Now we are going to construct the second confidence interval \(\overline{x}\pm 1.96 s/\sqrt{n}\) for each of the 100 samples, in which \(s\) is the sample SD of the sample.
= transform(samp, lower = mean - 1.96 * sd/sqrt(4))
samp = transform(samp, upper = mean + 1.96 * sd/sqrt(4)) samp
We plot the 100 confidence intervals as follows.
plot.ci(samp, mu = mean(population))
We can see that obviously over 5% of the intervals above fail to include the true population mean \(\mu= 177.7453.\) The actual confidence level of the \(z\)-interval is lower than the nomial level 95% when the unknown \(\sigma\) is replaced with the sample SD \(s\).
Observe that the 100 intervals above are NOT equal in length because the sample SD \(s\) changes from sample to sample.
Observe that the samples with smaller SD \(s\) are more likely to miss \(\mu\) since their intervals tend to be shorter in length.
For a sample of size 4, df \(= 4-1=3\), we can find \(t^*=3.18\) from the \(t\)-table. The t-CI for each of the 100 samples is calculated and plotted as follows:
= transform(samp, lower = mean - 3.18 * sd/sqrt(4))
samp = transform(samp, upper = mean + 3.18 * sd/sqrt(4)) samp
plot.ci(samp, mu = mean(population))
We can see that using the \(t\)-interval, the actual confidence level goes back to the nominal level 95%.
Now use the 260 women in the data set as the population. Make a histogram for the height of the 260 women. Do you find any outlier or clear skewness? Find the population mean and SD of the height of the 260 women.
Take 100 samples of size 5 from the population of the 260 women.
For each of the 100 samples, construct the 3 kinds of confidence
intervals ( \(\overline{x}\pm
z^*\sigma/\sqrt{n}\), \(\overline{x}\pm
z^*s/\sqrt{n}\), and \(\overline{x}\pm
t^*s/\sqrt{n}\), ) for the mean height of the 260 women at 90%
level. Use the plot_ci
function to plot the 3 kinds of
confidence intervals. For each of the 3 kinds of intervals, calculate
the proportion of intervals that include the true population mean. Are
those percentages close to the nominal level 90%?
This lab was modified by Yibi Hung from a lab written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel. This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported.