--- title: 'Lab 6: Foundations for Statistical Inference - Sampling Distributions' output: html_document: css: ./lab.css highlight: pygments theme: cerulean fig_width: 4 fig_height: 3 out_width: 400px dpi: 100 pdf_document: default --- ```{r echo=F} library(knitr) knitr::opts_chunk$set(collapse=TRUE) ``` ## The Ames Real Estate Data Load the data and change the variable names `Gr.Liv.Area` and `SalePrice` to `area` and `price`. ```{r} download.file("https://www.stat.uchicago.edu/~yibi/s220/labs/ames.RData", destfile = "ames.RData") load("ames.RData") area = ames$Gr.Liv.Area price = ames$SalePrice ``` Summary statistics of `area`: ```{r} summary(area) sd(area) ``` histogram of `area`: ```{r, fig.width=4.5, fig.height=3.5} hist(area) ``` The code chunk below defines a new function `histnorm()` that can make a histogram of a variable and overlay a normal curve closest to the histogram. You can simply run the codes below. ```{r histnorm} library(ggplot2) histnorm = function(x, bins = 30, binwidth = NULL, xlab = deparse(substitute(x))){ df = data.frame(x) ggplot(df, aes(x=x)) + geom_histogram(aes(y=..density..), bins = bins, binwidth=binwidth, color="white") + xlab(xlab) + stat_function(fun = dnorm, args = list(mean = mean(df$x), sd = sd(df$x)), lwd = 1, col = 'blue') } ``` We can overlay the closest normal curve to the histogram of `area` using the `histnorm` function we just defined. ```{r, fig.width=4.5, fig.height=2.5} histnorm(area, bins=40) ``` 1. Comment on the shape of this population distribution. Is it symmetric, left-skewed or right-skewed? ## The Unknown Sampling Distribution Randomly sample 50 values from the 2930 `area` values and store in in the variable `samp1`: ```{r} samp1 = sample(area, 50) ``` 2. Describe the distribution of this sample. How does it compare to the distribution of the population? If we're interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean. ```{r} mean(samp1) ``` 3. Take a second sample, also of size 50, and call it `samp2`. How does the mean of `samp2` compare with the mean of `samp1`? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean? Computing a single sample mean is easy: ```{r} mean(sample(area, 50)) ``` We can repeat the process above (taking a sample of size 50 and obtain the sample mean), say 5000 times, and get 5000 sample means using the `replicate()` function in R. ```{r} sample_means50 = replicate(5000, mean(sample(area, 50))) ``` The `replicate` function simply repeats a statement and collects the results as a vector. Here are the results of the first 6 repetitions. ```{r} head(sample_means50) ``` Histogram of `sample_means50`, which shows roughly the sampling distribution of the sample mean. ```{r} histnorm(sample_means50, bins=40) ``` Again you should try a number of different `binwidth` values until you get a histogram that best displays the shape of the distribution. 4. How many elements are there in `sample_means50`? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means? ## Interlude: The `replicate()` Function 5. To make sure you understand what the `replicate` function does, try modifying the code to take only 100 sample means and put them in a data frame named `sample_means_small`. Print the output to your screen (type `sample_means_small` into the console and press enter). How many elements are there in this object called `sample_means_small`? What does each element represent? ## Sample Size and the Sampling Distribution ```{r} sample_means10 = replicate(5000, mean(sample(area, 10))) sample_means50 = replicate(5000, mean(sample(area, 50))) sample_means100 = replicate(5000, mean(sample(area, 100))) ``` ```{r} histnorm(sample_means10, bins=40) histnorm(sample_means50, bins=40) histnorm(sample_means100, bins=40) ``` 6. Make a histogram for each to examine the three sampling distributions. When the sample size is larger, what happens to the center, spread, and shape of the sampling distributions? ## Using the Central Limit Theorem (CLT) By CLT, the sampling distribution of the sample mean $\overline{X}$ is roughly normal with mean $\mu=1499.69$ and SD $\sigma/\sqrt{n}=505.5089/\sqrt{100}=50.55089.$ where $\mu=1499.69$ Sq.ft. is the **population** mean and $\sigma=505.5089$ Sq.ft. is the **population** SD, found as follows. ```{r} mean(area) sd(area) ``` We can find $P(\overline{X}<1450)$ to be ```{r} pnorm(1450, 1499.69, 505.5089/sqrt(100)) ``` That is, drawing a random sample of size 100, the sample mean will be below 1450 for about 16% of the time. The actual proportion of the 5000 sample means that are below 1450: ```{r} table(sample_means100 < 1450) table(sample_means100 < 1450)/5000 ``` Is the proportion close to 16%? * * * ## On Your Own So far, we have only focused on estimating the mean living area in homes in Ames. Now you'll try to estimate the mean home price. - Make a histogram of `price`, which shows the population distribution of the sale price of all 2930 homes in the data. Comment on the shape of the histogram. - Find the mean and SD of `price`, which are the population mean and the population SD. - Take a random sample of size 25 from `price`. Find the sample mean, and compared it with the population mean you found the previous part - Since you have access to the population, simulate the sampling distribution for $\bar{x}_{price}$ by taking 5000 samples from the population of size 25 and computing 5000 sample means. Store these means in a vector called `sample_means25`. Make a histogram of the 5000 sample means. - Repeat the previous part but change the sample size from 25 to 4. Store the 5000 sample means in a vector called `sample_means4`. - Repeat the previous part again but change the sample size to 100. Store the 5000 sample means in a vector called `sample_means100`. - Compare the center, spread, and the shape Of the 3 sampling distributions from previous 3 parts. How do the center, spread, and shape of the sampling distributions change with the sample size? - Theoretically, the mean and the SD of the sampling distribution are respectively $\mu$ and $\sigma/\sqrt{n}$, where $\mu$ and $\sigma$ are the population mean and the population SD found in part (b), and $n$ is the sample size. Compute the means and SDs for the 5000 sample means as follows. ```{r} mean(sample_means4) mean(sample_means25) mean(sample_means100) sd(sample_means4) sd(sample_means25) sd(sample_means100) ``` Are they close to their respective theoretical values $\mu$ and $\sigma/\sqrt{n}$? - Use the CLT to find the (approximate) probability of getting a sample mean below $170,000, when the sample size is 100. - Find the percentage of the 5000 sample means in `sample_means100` that are below $170,000. Is the percentage close to the probability computed in the previous part using CLT? - Use the CLT to find the (approximate) probability of getting a sample mean between $130,000, and $190,000, when the sample size is 4. - Find the percentage of the 5000 sample means in `sample_means4` that are between $130,000, and $190,000. ```{r} table(sample_means100 < 190000 & sample_means100 > 130000) table(sample_means100 < 190000 & sample_means100 > 130000)/5000 ``` Is the percentage close to the probability computed in the previous part using CLT? Does the CLT work well when the sample size is only 4?
This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics. This lab can be shared or edited under a [Creative Commons Attribution-ShareAlike 3.0 Unported licence](http://creativecommons.org/licenses/by-sa/3.0).