The goal of this lab is to introduce you to R, RStudio, and R Markdown, which you’ll be using throughout the course both to learn the statistical concepts discussed in the textbook and also to analyze real data and come to informed conclusions. To clarify which is which:
This lab document is written in R Markdown. It allows you to integrate R code, its output, and explanatory text in a single document.
Download the installation file here. You will need to choose a location that is close to you (this affects the download speed). Please install the correct version for your operating system. You should agree to all of the installation defaults (unless you already know how to customize R).
RStudio provides a friendlier working environment for R
Download RStudio Preview at https://posit.co/download/rstudio-desktop/ (open a new tab and copy/paste the link) . Select an installer from (Desktop Version) based on your OS and then install.
We recommend always starting up the RStudio program instead of working in R directly.
The RStudio interface consists of several windows (see Figure 1 below).
Bottom left: command window. Here you can type simple commands after the “>” prompt and R will then execute your command. This is the most important window, because this is where R actually does stuff.
Top left: script window. Collections of commands
(scripts) can be edited and saved. When you don’t get this window, you
can open it with [File] – [New] – [R script]. Just typing a command in
the editor window is not enough, it has to get into the command window
before R executes the command. If you want to run a line from the script
window (or the whole script), you can click Run or press
CTRL+ENTER
to send it to the command window.
Top right: workspace / history window. In the workspace window you can see which data and values R has in its memory. You can view and edit the values by clicking on them. The history window shows what has been typed before.
Bottom right: files / plots / packages / help window. Here you can open files, view plots (also previous plots), install and load packages or use the help function.
You can change the size of the windows by dragging the grey bars between the windows.
Now that you have R and RStudio installed, let’s create your first R Markdown document:
You’ll see a new document with some example content.
Let’s walk through what you’re seeing:
At the top of the document, you’ll see something like this:
---
title: "My First R Markdown"
output: html_document
date: "2024-10-02"
---
This is called the YAML header. It specifies metadata about your document, including the title, output format, and date.
Right after the YAML header, you’ll see a code chunk that looks like this:
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This is a special setup chunk that sets options for the entire
document. The include=FALSE
option means this chunk won’t
appear in the final document.
You can leave the setup chunk as is.
Below the setup chunk, you’ll see some example text:
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
This shows you how to create headers (using ##
), links,
and bold text (using **
) in Markdown.
You’ll see a code chunk that looks like this:
```{r cars}
summary(cars)
```
This is an example of how to embed R code in your document. When you knit the document, this code will be executed and the results will be included in the output.
You’ll see another code chunk that creates a plot:
```{r pressure, echo=FALSE}
plot(pressure)
```
This demonstrates how you can create and include plots in your R
Markdown document. The echo=FALSE
option means that the R
code itself won’t be shown in the final document, only the resulting
plot. The word ‘pressure’ is just a name that was made up for this
chunk. You don’t need to name your chunks.
Now, let’s modify this document slightly:
## My First Plot
after the
existing content ## My First Plot
```{r first-plot}
plot(cars)
```
(You can create a new code chunk in several ways. One way is to type
or paste in the three backticks followed by {r}
, type your
code, and then end the chunk with three more backticks. When you copy
and paste code like what you see above, that already includes the three
backticks for you.
You can also click in the document, then click the “Insert Chunk” button in the upper right corner of the script window, and choose the option “R” from the menu. This will insert a new chunk into your document, writing the backticks for you automatically.)
You will want to create a directory on your computer for all your work in this class, perhaps called “Stat 220”. Then, inside that directory, create a subdirectory called “Lab 0”. You are going to save your R Markdown file into that subdirectory.
Save your document by choosing File > Save, and selecting the “Lab 0” directory. Give the file whatever name you want, perhaps “Lab0”.
Before you render your document to PDF, you need to make sure you have a LaTeX distribution installed on your computer. You will only need to do this once on your computer.
R Markdown uses LaTeX to create PDF documents. We recommend using TinyTeX, a lightweight LaTeX distribution that’s easy to install and manage from within R.
To install TinyTeX, follow these steps. Because this is something you only need to do once, you should type these commands into the R console (bottom left panel) instead of into your R Markdown document. (This is because commands that you enter into the R Markdown document will get run every time you render your document into HTML or PDF, and you don’t want to run these commands over and over again!)
Type the following commands into the console, pressing Enter after each one:
install.packages('tinytex')
tinytex::install_tinytex()
You will see a lot of output as the installation proceeds. If you see an error message, don’t worry—just try again. Sometimes the installation can fail if there are network issues, but retrying usually works. This may take a few minutes to complete.
Note for Mac users: During the installation process, you may see a dialog box asking for permission to modify files. Click “Yes” or “Allow” to proceed with the installation.
Once it’s done, you’ll have TinyTeX installed and ready to use.
Now, let’s convert (or “knit”) your R Markdown document to a PDF:
Congratulations! You’ve just created and knitted your first R Markdown document. Take a moment to look at the PDF and see how the markdown text and R code have been rendered.
Sometimes, you may encounter issues when trying to knit directly to PDF. In such cases, knitting to HTML and then printing to PDF can be a useful alternative. Here’s how to do it:
After knitting to HTML, you can open the file in an external browser for better viewing:
Once your HTML file is open in a web browser, you can easily create a PDF:
This method often results in a well-formatted PDF, even when direct PDF knitting encounters problems. It’s a good fallback option to ensure you can always produce a PDF version of your work.
Remember, for most purposes, the HTML output is perfectly suitable for viewing and sharing your work. Only convert to PDF when necessary for submission or if specifically required.
tidyverse
and lattice
LibrariesWe’ll use the tidyverse
and lattice
packages in this course. Again this is something that you only need to
do once on your computer. Install them by typing the following code in
your R console. (If you can’t find your R console, it may be hiding
behind another tab in the bottom left corner of RStudio.) Press Enter
after each line:
install.packages("tidyverse")
install.packages("lattice")
If you see a message that asks if you want to install from source when necessary, you should type “yes” and press Enter. On a Mac, you may see a dialog box asking if you’d like to install command line tools. You should click “yes” and wait for the installation to complete, and then restart RStudio.
You will know that the packages are installed when RStudio says something like “The downloaded binary packages are in” and then gives a long location, like /var/folders….
tidyverse
and lattice
LibrariesOnce you have installed the packages, you need to tell R that you’d like to use them. You’ll need to do this every time you start a new R Markdown document. You load the libraries by typing the following code in a code chunk early in your R Markdown document:
```{r load-packages, message=FALSE}
library(tidyverse)
library(lattice)
```
We include the message=FALSE
option to suppress annoying
messages that R sometimes gives when you load a package.
If you’ve made it this far, congratulations! Now is a good time to take a break!
The Arbuthnot data set we are going to work on today comes from Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710.
The data are available at [https://www.openintro.org/data/tab-delimited/arbuthnot.txt]. You can download this file by right clicking and selecting “Save Link As…”, or by clicking on the link and then choosing to save the file, giving it the name “arbuthnot.txt”. You should save this data file into the same directory as your R Markdown file, the “Lab 0” directory inside your “Stat 220” directory. That way, R will know where to find it.
From here on out, just for brevity, I won’t write the backticks at the beginnings and ends of chunks. You should still create chunks for your code in your R Markdown file. Run each chunk after you type it in, to make sure it works.
We load the data into R by using the read.table
function.
arbuthnot <- read.table("arbuthnot.txt", header = TRUE)
The header = TRUE
option tells R that the first row of
the data file contains the names of the variables, rather than data.
If something went wrong, you’ll see an error message below the chunk.
You can also click on the “Environment” tab in RStudio to see the data frame you just created. It will have the name “arbuthnot”. If you click on the disclosure triangle next to it to expand it, you’ll see the individual variables that are part of the dataset.
The name in R for a dataset like this, which is a collection of variables observed on a set of individuals, is “data frame”.
Let’s take a look at the first few rows of the data:
head(arbuthnot)
## year boys girls
## 1 1629 5218 4683
## 2 1630 4858 4457
## 3 1631 4422 4102
## 4 1632 4994 4590
## 5 1633 5158 4839
## 6 1634 5035 4820
We can see the dimensions of this data frame:
dim(arbuthnot)
## [1] 82 3
And the names of the columns:
names(arbuthnot)
## [1] "year" "boys" "girls"
We can access a single variable of a data frame using the dollar-sign
$
notation. For example, the command below
arbuthnot$boys
## [1] 5218 4858 4422 4994 5158 5035 5106 4917 4703 5359 5366 5518 5470 5460 4793
## [16] 4107 4047 3768 3796 3363 3079 2890 3231 3220 3196 3441 3655 3668 3396 3157
## [31] 3209 3724 4748 5216 5411 6041 5114 4678 5616 6073 6506 6278 6449 6443 6073
## [46] 6113 6058 6552 6423 6568 6247 6548 6822 6909 7577 7575 7484 7575 7737 7487
## [61] 7604 7909 7662 7602 7676 6985 7263 7632 8062 8426 7911 7578 8102 8031 7765
## [76] 6113 8366 7952 8379 8239 7840 7640
will return the boys
variable in the data frame
arbuthnot
, which shows the number of boys baptized each
year. Likewise arbuthnot$girls
will return just the counts
of girls baptized each year.
Notice that the way R has printed these data is different. When we
looked at the complete data frame, we saw 82 rows, one on each line of
the display. These data are no longer structured in a table with other
variables, so they are displayed one right after another. Objects that
print out in this way are called vectors; they represent a set
of numbers. R has added numbers in [brackets] along the left side of the
printout to indicate locations within the vector. For example,
5218
follows [1]
, indicating that
5218
is the first entry in the vector. And the second line
starts with [15]
means that first number 4793
in the second line is the 15-th entry in the vector, and so on for the
rest of the output.
To see the element in the row 56 and the column 3 (in this case,
girls
) of the table, you can type
arbuthnot[56,3]
To see the first 10 numbers in the third column, you can type
arbuthnot[1:10,3]
In this expression, we have asked just for rows in the range 1
through 10. R uses the :
to create a range of values, so
1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by
entering
1:10
Finally, if you want all of the data for the first 10 years, type
arbuthnot[1:10,]
By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the observations, not just the 56-th, or rows 1 through 10. Try the following to see the 3rd variable for all 82 years
arbuthnot[,3]
Recall that column 3 represents the number of girls baptized in that
year, so the command above should output the same list of numbers as
arbuthnot$girls
. We see the number of girls baptized for
the 56th year by typing
arbuthnot$girls[56]
Similarly, for just the first 10 years
arbuthnot$girls[1:10]
The command above returns the same result as the
arbuthnot[1:10,3]
command. Both row-and-column notation and
dollar-sign notation are widely used, which one you choose to use
depends on your personal preference.
Now, let’s create a plot of the number of girls baptized per year:
qplot(year, girls, data = arbuthnot)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
How does that function, qplot
, work? You can see the
documentation for qplot
by typing ?qplot
in
the console and then pressing Enter.
(You can also copy and paste a line of code into an AI chatbot like ChatGPT and ask it to explain the code.)
We can add lines to connect the data points:
qplot(year, girls, data = arbuthnot, geom = "line",
ylab = "Number of Baby Girls Baptized", xlab = "Year")
Now, suppose you are interested in the ratio of the number of boys to the number of girls baptized in 1629 To compute this, you could use the fact that R is really just a big calculator. You can type in mathematical expressions like
5218 / 4683
to see the boys/girls ratio in 1629. We could repeat this once for each year, but there is a faster way. If we take the ratio of the vector for baptisms for boys and girls, R will compute all ratios simultaneously.
arbuthnot$boys/arbuthnot$girls
You can also use with
to avoid repeatedly typing the
name of the data frame. with
instructs R to interpret
everything else from within the data frame that you specify.
with(arbuthnot, boys/ girls)
What you will see are 82 numbers, each one representing the sum we’re after. Take a look at a few of them and verify that they are right. Therefore, we can make a plot of the total number of baptisms per year with the command
qplot(year, boys / girls, data=arbuthnot, geom = "line", ylab="Boys/Girls Ratio")
Similarly to how we computed the proportion of boys, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with
5218 / (4683+5218)
or we can act on the complete vectors with the expression
with(arbuthnot, boys/(boys + girls))
Note that with R as with your calculator, you need to be conscious of the order of operations. Here, we want to divide the number of boys by the total number of newborns, so we have to use parentheses. Without them, R will first do the division, then the addition, giving you something that is not a proportion.
Now, we make a plot of the proportion of boys over time. What do you see?
qplot(year, boys/(boys + girls), data=arbuthnot, geom = "line", ylab="Proportion of Boys")
To end your R Markdown session, simply save your .Rmd file. The next time you open it, all your work will be there, ready for you to continue or to knit into a final document.
In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. The “present” data set can be downloaded at [https://www.openintro.org/data/tab-delimited/present.txt]
What years are included in this data set? What are the dimensions of the data frame and what are the variable or column names?
Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response.
In what year did we see the most total number of births in the U.S.? You can refer to the help files or the R reference card http://cran.r-project.org/doc/contrib/Short-refcard.pdf to find helpful commands.