---
title: 'Lab 6: Foundations for Statistical Inference - Sampling Distributions'
output:
  html_document:
    css: ./lab.css
    highlight: pygments
    theme: cerulean
    fig_width: 4
    fig_height: 3
    out_width: 400px
    dpi: 100
    pdf_document: default
---

```{r echo=F}
library(knitr)
knitr::opts_chunk$set(collapse=TRUE)
```

## The Ames Real Estate Data

Load the data and change the variable names 
`Gr.Liv.Area` and `SalePrice` to `area` and `price`.

```{r}
download.file("https://www.stat.uchicago.edu/~yibi/s220/labs/ames.RData", destfile = "ames.RData")
load("ames.RData")
area = ames$Gr.Liv.Area
price = ames$SalePrice
```

Summary statistics of `area`:

```{r}
summary(area)
sd(area)
```

histogram of `area`:

```{r, fig.width=4.5, fig.height=3.5}
hist(area)
```


The code chunk below defines a new function `histnorm()`
that can make a histogram of a variable
and overlay a normal curve closest to the histogram.
You can simply run the codes below.
```{r histnorm}
library(ggplot2)
histnorm = function(x, bins = 30, binwidth = NULL, xlab = deparse(substitute(x))){
  df = data.frame(x)
  ggplot(df, aes(x=x)) + 
    geom_histogram(aes(y=..density..), bins = bins,
                   binwidth=binwidth, color="white") +
    xlab(xlab) +
    stat_function(fun = dnorm, 
                  args = list(mean = mean(df$x), sd = sd(df$x)), 
                  lwd = 1, col = 'blue')
}
```

We can overlay the closest normal curve to the histogram of `area` 
using the `histnorm` function we just defined.

```{r, fig.width=4.5, fig.height=2.5}
histnorm(area, bins=40) 
```

1.  Comment on the shape of this population distribution. 
Is it symmetric, left-skewed or right-skewed?


## The Unknown Sampling Distribution

Randomly sample 50 values from the 2930 `area` values
and store in  in the variable `samp1`:
```{r}
samp1 = sample(area, 50)
```

2.  Describe the distribution of this sample. 
How does it compare to the distribution of the population?

If we're interested in estimating the average living area in homes in Ames 
using the sample, our best single guess is the sample mean.

```{r}
mean(samp1)
```


3.  Take a second sample, also of size 50, and call it `samp2`.  How does the 
    mean of `samp2` compare with the mean of `samp1`?  Suppose we took two 
    more samples, one of size 100 and one of size 1000. Which would you think 
    would provide a more accurate estimate of the population mean?


Computing a single sample mean is easy:

```{r}
mean(sample(area, 50))
```

We can repeat the process above (taking a sample of size 50 
and obtain the sample mean), say 5000 times,
and get 5000 sample means using the `replicate()` function in R.

```{r}
sample_means50 = replicate(5000, mean(sample(area, 50))) 
```

The `replicate` function simply repeats a statement and 
collects the results as a vector. 
Here are the results of the first 6 repetitions.

```{r}
head(sample_means50)
```

Histogram of `sample_means50`,
which shows roughly the sampling distribution of the sample mean.

```{r}
histnorm(sample_means50, bins=40)
```

Again you should try a number of different `binwidth` values
until you get a histogram that best displays the 
shape of the distribution. 

4.  How many elements are there in `sample_means50`?  Describe the sampling 
    distribution, and be sure to specifically note its center.  Would you 
    expect the distribution to change if we instead collected 50,000 sample 
    means?

## Interlude: The `replicate()` Function

5.  To make sure you understand what the `replicate` function does, try modifying the
    code to take only 100 sample means and put them in a data frame named
    `sample_means_small`. Print the output to your screen (type 
    `sample_means_small` into the console and press enter). How many elements 
    are there in this object called `sample_means_small`? What does each 
    element represent?

## Sample Size and the Sampling Distribution


```{r}
sample_means10 = replicate(5000, mean(sample(area, 10)))
sample_means50 = replicate(5000, mean(sample(area, 50)))
sample_means100 = replicate(5000, mean(sample(area, 100)))
```

```{r}
histnorm(sample_means10, bins=40)
histnorm(sample_means50, bins=40)
histnorm(sample_means100, bins=40)
```

6. Make a histogram for each to examine the three sampling distributions.
When the sample size is larger,
what happens to the center, spread, and shape of the sampling distributions?

## Using the Central Limit Theorem (CLT)


By CLT, the sampling distribution of the sample mean $\overline{X}$ is roughly normal
with mean $\mu=1499.69$ and SD $\sigma/\sqrt{n}=505.5089/\sqrt{100}=50.55089.$
where $\mu=1499.69$ Sq.ft. is the **population** mean and $\sigma=505.5089$ Sq.ft. is the **population** SD, found as follows.
```{r}
mean(area)
sd(area)
```

We can find $P(\overline{X}<1450)$ to be 
```{r}
pnorm(1450, 1499.69, 505.5089/sqrt(100))
```
That is, drawing a random sample of size 100, 
the sample mean will be below 1450 for about 16% of the time.

The actual proportion of the 5000 sample means that are below 1450:
```{r}
table(sample_means100 < 1450)
table(sample_means100 < 1450)/5000
```

Is the proportion close to 16%?

* * *
## On Your Own

So far, we have only focused on estimating the mean living area in homes in 
Ames.  Now you'll try to estimate the mean home price.

-   Make a histogram of `price`,  which shows the population distribution of the sale price of all 2930 homes in the data. Comment on the shape of the histogram.
-   Find the mean and SD of `price`, 
which are the population mean and the population SD.
-   Take a random sample of size 25 from `price`. Find the sample mean, and compared it with the population mean you found the previous part
-   Since you have access to the population, simulate the sampling 
    distribution for $\bar{x}_{price}$ by taking 5000 samples from the 
    population of size 25 and computing 5000 sample means.  Store these means 
    in a vector called `sample_means25`. Make a histogram of the 5000 sample means. 
-   Repeat the previous part but change the sample size from 25 to 4. Store the 5000 sample means in a vector called `sample_means4`.
-   Repeat the previous part again but change the sample size to 100. Store the 5000 sample means in a vector called `sample_means100`.
-   Compare the center, spread, and the shape Of the 3 sampling distributions from previous 3 parts.
How do the center, spread, and shape of the sampling distributions change with the sample size?
-   Theoretically, the mean and the SD of the sampling distribution are respectively
$\mu$ and $\sigma/\sqrt{n}$, where $\mu$ and
$\sigma$ are the population mean and the population SD found in part (b), and $n$ is the sample size. 

Compute the means and SDs for the 5000 sample means as follows.
```{r}
mean(sample_means4)
mean(sample_means25)
mean(sample_means100)
sd(sample_means4)
sd(sample_means25)
sd(sample_means100)
```
Are they close to their respective theoretical values
$\mu$ and $\sigma/\sqrt{n}$?


-    Use the CLT to find the (approximate) 
probability of getting a sample mean below $170,000,
when the sample size is 100.
-    Find the percentage of the 5000 sample means in
`sample_means100` that are below $170,000. 
Is the percentage close to the probability computed in the previous part using CLT?
-    Use the CLT to find the (approximate) 
probability of getting a sample mean between $130,000,
and $190,000, when the sample size is 4.
-    Find the percentage of the 5000 sample means in
`sample_means4` that are between $130,000,
and $190,000.
```{r}
table(sample_means100 < 190000 & sample_means100 > 130000)
table(sample_means100 < 190000 & sample_means100 > 130000)/5000
```
Is the percentage close to the probability computed in the previous part using CLT?
Does the CLT work well when the sample size is only 4?

<div id="license">
This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine &Ccedil;etinkaya-Rundel, which  originally was based on a lab written by Mark Hansen of UCLA Statistics.

This lab can be shared or edited under a 
[Creative Commons Attribution-ShareAlike 3.0 Unported licence](http://creativecommons.org/licenses/by-sa/3.0). 
</div>