• Introduction
  • Before We Start
  • Creating Your First R Markdown Document
  • Creating a PDF (or HTML) document from your R Markdown file
  • Installing and using your first R packages
  • Analyzing Data
  • Ending
  • On Your Own

Introduction

The goal of this lab is to introduce you to R, RStudio, and R Markdown, which you’ll be using throughout the course both to learn the statistical concepts discussed in the textbook and also to analyze real data and come to informed conclusions. To clarify which is which:

  • R is the name of the programming language itself
  • RStudio is a convenient interface for R
  • R Markdown is a file format for making dynamic documents with R

This lab document is written in R Markdown. It allows you to integrate R code, its output, and explanatory text in a single document.

Before We Start

Install R

Download the installation file here. You will need to choose a location that is close to you (this affects the download speed). Please install the correct version for your operating system. You should agree to all of the installation defaults (unless you already know how to customize R).

Install RStudio

RStudio provides a friendlier working environment for R

Download RStudio Preview at https://posit.co/download/rstudio-desktop/ (open a new tab and copy/paste the link) . Select an installer from (Desktop Version) based on your OS and then install.

We recommend always starting up the RStudio program instead of working in R directly.

Rstudio Layout

The RStudio interface consists of several windows (see Figure 1 below).

Figure 1
Figure 1
  • Bottom left: command window. Here you can type simple commands after the “>” prompt and R will then execute your command. This is the most important window, because this is where R actually does stuff.

  • Top left: script window. Collections of commands (scripts) can be edited and saved. When you don’t get this window, you can open it with [File] – [New] – [R script]. Just typing a command in the editor window is not enough, it has to get into the command window before R executes the command. If you want to run a line from the script window (or the whole script), you can click Run or press CTRL+ENTER to send it to the command window.

  • Top right: workspace / history window. In the workspace window you can see which data and values R has in its memory. You can view and edit the values by clicking on them. The history window shows what has been typed before.

  • Bottom right: files / plots / packages / help window. Here you can open files, view plots (also previous plots), install and load packages or use the help function.

You can change the size of the windows by dragging the grey bars between the windows.

Creating Your First R Markdown Document

Now that you have R and RStudio installed, let’s create your first R Markdown document:

  1. Open RStudio
  2. Click on File > New File > R Markdown
  3. In the dialog box that appears, enter a title for your document (e.g., “My First R Markdown”)
  4. Keep the default output format as HTML
  5. Click “OK”

You’ll see a new document with some example content.

Parts of an R Markdown Document

Let’s walk through what you’re seeing:

YAML Header

At the top of the document, you’ll see something like this:

---
title: "My First R Markdown"
output: html_document
date: "2024-10-02"
---

This is called the YAML header. It specifies metadata about your document, including the title, output format, and date.

Setup Chunk

Right after the YAML header, you’ll see a code chunk that looks like this:

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This is a special setup chunk that sets options for the entire document. The include=FALSE option means this chunk won’t appear in the final document.

You can leave the setup chunk as is.

Text and Markdown Formatting

Below the setup chunk, you’ll see some example text:

## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

This shows you how to create headers (using ##), links, and bold text (using **) in Markdown.

Code Chunks

You’ll see a code chunk that looks like this:

```{r cars}
summary(cars)
```

This is an example of how to embed R code in your document. When you knit the document, this code will be executed and the results will be included in the output.

Plots

You’ll see another code chunk that creates a plot:

```{r pressure, echo=FALSE}
plot(pressure)
```

This demonstrates how you can create and include plots in your R Markdown document. The echo=FALSE option means that the R code itself won’t be shown in the final document, only the resulting plot. The word ‘pressure’ is just a name that was made up for this chunk. You don’t need to name your chunks.

Now, let’s modify this document slightly:

  1. Replace the title with “Lab 0: My First R Markdown Document”
  2. Add your name as the author in the YAML header
  3. Change the date to today’s date
  4. Add a new section by typing ## My First Plot after the existing content
  5. Below this new heading, add the following text:
   ## My First Plot
   ```{r first-plot}
   plot(cars)
   ```

(You can create a new code chunk in several ways. One way is to type or paste in the three backticks followed by {r}, type your code, and then end the chunk with three more backticks. When you copy and paste code like what you see above, that already includes the three backticks for you.

You can also click in the document, then click the “Insert Chunk” button in the upper right corner of the script window, and choose the option “R” from the menu. This will insert a new chunk into your document, writing the backticks for you automatically.)

Saving your document

You will want to create a directory on your computer for all your work in this class, perhaps called “Stat 220”. Then, inside that directory, create a subdirectory called “Lab 0”. You are going to save your R Markdown file into that subdirectory.

Save your document by choosing File > Save, and selecting the “Lab 0” directory. Give the file whatever name you want, perhaps “Lab0”.

Creating a PDF (or HTML) document from your R Markdown file

Installing TinyTeX for PDF Output

Before you render your document to PDF, you need to make sure you have a LaTeX distribution installed on your computer. You will only need to do this once on your computer.

R Markdown uses LaTeX to create PDF documents. We recommend using TinyTeX, a lightweight LaTeX distribution that’s easy to install and manage from within R.

To install TinyTeX, follow these steps. Because this is something you only need to do once, you should type these commands into the R console (bottom left panel) instead of into your R Markdown document. (This is because commands that you enter into the R Markdown document will get run every time you render your document into HTML or PDF, and you don’t want to run these commands over and over again!)

Type the following commands into the console, pressing Enter after each one:

install.packages('tinytex')

tinytex::install_tinytex()

You will see a lot of output as the installation proceeds. If you see an error message, don’t worry—just try again. Sometimes the installation can fail if there are network issues, but retrying usually works. This may take a few minutes to complete.

Note for Mac users: During the installation process, you may see a dialog box asking for permission to modify files. Click “Yes” or “Allow” to proceed with the installation.

Once it’s done, you’ll have TinyTeX installed and ready to use.

Knitting to PDF

Now, let’s convert (or “knit”) your R Markdown document to a PDF:

  1. Look for the “Knit” button at the top of the RStudio script editor
  2. Click on the small arrow next to “Knit” and select “Knit to PDF”
  3. RStudio will process your document and create a PDF output. It should save the PDF in the same directory as your R Markdown file.
  4. Once complete, the PDF will open in a new window

Congratulations! You’ve just created and knitted your first R Markdown document. Take a moment to look at the PDF and see how the markdown text and R code have been rendered.

Knitting to HTML and Printing to PDF

Sometimes, you may encounter issues when trying to knit directly to PDF. In such cases, knitting to HTML and then printing to PDF can be a useful alternative. Here’s how to do it:

Knitting to HTML

  1. In RStudio, click on the “Knit” button at the top of the editor.
  2. Select “Knit to HTML” from the dropdown menu.
  3. RStudio will process your document and create an HTML file.

Opening the HTML File in an External Browser

After knitting to HTML, you can open the file in an external browser for better viewing:

  1. In the “Viewer” pane (usually in the bottom right of RStudio), you’ll see your rendered HTML document.
  2. Look for a small “Show in new window” icon in the top left corner of the Viewer pane (it looks like a pop-out window).
  3. Click this icon, and your default web browser will open the HTML file.

Printing to PDF from the Browser

Once your HTML file is open in a web browser, you can easily create a PDF:

  1. In your web browser, go to File > Print (or use the keyboard shortcut Ctrl+P on Windows/Linux or Cmd+P on Mac).
  2. In the print dialog, look for an option to “Save as PDF” or “Print to PDF”.
  3. Choose a location to save your PDF file and click “Save” or “Print”.

This method often results in a well-formatted PDF, even when direct PDF knitting encounters problems. It’s a good fallback option to ensure you can always produce a PDF version of your work.

Remember, for most purposes, the HTML output is perfectly suitable for viewing and sharing your work. Only convert to PDF when necessary for submission or if specifically required.

Installing and using your first R packages

Installing the tidyverse and lattice Libraries

We’ll use the tidyverse and lattice packages in this course. Again this is something that you only need to do once on your computer. Install them by typing the following code in your R console. (If you can’t find your R console, it may be hiding behind another tab in the bottom left corner of RStudio.) Press Enter after each line:

install.packages("tidyverse")

install.packages("lattice")

If you see a message that asks if you want to install from source when necessary, you should type “yes” and press Enter. On a Mac, you may see a dialog box asking if you’d like to install command line tools. You should click “yes” and wait for the installation to complete, and then restart RStudio.

You will know that the packages are installed when RStudio says something like “The downloaded binary packages are in” and then gives a long location, like /var/folders….

Loading the tidyverse and lattice Libraries

Once you have installed the packages, you need to tell R that you’d like to use them. You’ll need to do this every time you start a new R Markdown document. You load the libraries by typing the following code in a code chunk early in your R Markdown document:

```{r load-packages, message=FALSE}
library(tidyverse)
library(lattice)
```

We include the message=FALSE option to suppress annoying messages that R sometimes gives when you load a package.

If you’ve made it this far, congratulations! Now is a good time to take a break!

Analyzing Data

Downloading a dataset

The Arbuthnot data set we are going to work on today comes from Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710.

The data are available at [https://www.openintro.org/data/tab-delimited/arbuthnot.txt]. You can download this file by right clicking and selecting “Save Link As…”, or by clicking on the link and then choosing to save the file, giving it the name “arbuthnot.txt”. You should save this data file into the same directory as your R Markdown file, the “Lab 0” directory inside your “Stat 220” directory. That way, R will know where to find it.

Loading the Data to R from the file

From here on out, just for brevity, I won’t write the backticks at the beginnings and ends of chunks. You should still create chunks for your code in your R Markdown file. Run each chunk after you type it in, to make sure it works.

We load the data into R by using the read.table function.

arbuthnot <- read.table("arbuthnot.txt", header = TRUE)

The header = TRUE option tells R that the first row of the data file contains the names of the variables, rather than data.

If something went wrong, you’ll see an error message below the chunk.

You can also click on the “Environment” tab in RStudio to see the data frame you just created. It will have the name “arbuthnot”. If you click on the disclosure triangle next to it to expand it, you’ll see the individual variables that are part of the dataset.

The name in R for a dataset like this, which is a collection of variables observed on a set of individuals, is “data frame”.

A First Look at Arbuthnot’s Data

Let’s take a look at the first few rows of the data:

head(arbuthnot)
##   year boys girls
## 1 1629 5218  4683
## 2 1630 4858  4457
## 3 1631 4422  4102
## 4 1632 4994  4590
## 5 1633 5158  4839
## 6 1634 5035  4820

We can see the dimensions of this data frame:

dim(arbuthnot)
## [1] 82  3

And the names of the columns:

names(arbuthnot)
## [1] "year"  "boys"  "girls"

We can access a single variable of a data frame using the dollar-sign $ notation. For example, the command below

arbuthnot$boys
##  [1] 5218 4858 4422 4994 5158 5035 5106 4917 4703 5359 5366 5518 5470 5460 4793
## [16] 4107 4047 3768 3796 3363 3079 2890 3231 3220 3196 3441 3655 3668 3396 3157
## [31] 3209 3724 4748 5216 5411 6041 5114 4678 5616 6073 6506 6278 6449 6443 6073
## [46] 6113 6058 6552 6423 6568 6247 6548 6822 6909 7577 7575 7484 7575 7737 7487
## [61] 7604 7909 7662 7602 7676 6985 7263 7632 8062 8426 7911 7578 8102 8031 7765
## [76] 6113 8366 7952 8379 8239 7840 7640

will return the boys variable in the data frame arbuthnot, which shows the number of boys baptized each year. Likewise arbuthnot$girls will return just the counts of girls baptized each year.

Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector. And the second line starts with [15] means that first number 4793 in the second line is the 15-th entry in the vector, and so on for the rest of the output.

Accessing elements of a data frame

To see the element in the row 56 and the column 3 (in this case, girls) of the table, you can type

arbuthnot[56,3]

To see the first 10 numbers in the third column, you can type

arbuthnot[1:10,3]

In this expression, we have asked just for rows in the range 1 through 10. R uses the : to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering

1:10

Finally, if you want all of the data for the first 10 years, type

arbuthnot[1:10,]

By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the observations, not just the 56-th, or rows 1 through 10. Try the following to see the 3rd variable for all 82 years

arbuthnot[,3]

Recall that column 3 represents the number of girls baptized in that year, so the command above should output the same list of numbers as arbuthnot$girls. We see the number of girls baptized for the 56th year by typing

arbuthnot$girls[56]

Similarly, for just the first 10 years

arbuthnot$girls[1:10]

The command above returns the same result as the arbuthnot[1:10,3] command. Both row-and-column notation and dollar-sign notation are widely used, which one you choose to use depends on your personal preference.

Plotting the data

Now, let’s create a plot of the number of girls baptized per year:

qplot(year, girls, data = arbuthnot)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

How does that function, qplot, work? You can see the documentation for qplot by typing ?qplot in the console and then pressing Enter.

(You can also copy and paste a line of code into an AI chatbot like ChatGPT and ask it to explain the code.)

We can add lines to connect the data points:

qplot(year, girls, data = arbuthnot, geom = "line", 
      ylab = "Number of Baby Girls Baptized", xlab = "Year")

Further Exploration of Arbuthnot’s Data

Now, suppose you are interested in the ratio of the number of boys to the number of girls baptized in 1629 To compute this, you could use the fact that R is really just a big calculator. You can type in mathematical expressions like

5218 / 4683

to see the boys/girls ratio in 1629. We could repeat this once for each year, but there is a faster way. If we take the ratio of the vector for baptisms for boys and girls, R will compute all ratios simultaneously.

arbuthnot$boys/arbuthnot$girls

You can also use with to avoid repeatedly typing the name of the data frame. with instructs R to interpret everything else from within the data frame that you specify.

with(arbuthnot, boys/ girls)

What you will see are 82 numbers, each one representing the sum we’re after. Take a look at a few of them and verify that they are right. Therefore, we can make a plot of the total number of baptisms per year with the command

qplot(year, boys / girls, data=arbuthnot, geom = "line", ylab="Boys/Girls Ratio")

Similarly to how we computed the proportion of boys, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with

5218 / (4683+5218)

or we can act on the complete vectors with the expression

with(arbuthnot, boys/(boys + girls))

Note that with R as with your calculator, you need to be conscious of the order of operations. Here, we want to divide the number of boys by the total number of newborns, so we have to use parentheses. Without them, R will first do the division, then the addition, giving you something that is not a proportion.

Now, we make a plot of the proportion of boys over time. What do you see?

qplot(year, boys/(boys + girls), data=arbuthnot, geom = "line", ylab="Proportion of Boys")

Ending

To end your R Markdown session, simply save your .Rmd file. The next time you open it, all your work will be there, ready for you to continue or to knit into a final document.


On Your Own

In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. The “present” data set can be downloaded at [https://www.openintro.org/data/tab-delimited/present.txt]

  • What years are included in this data set? What are the dimensions of the data frame and what are the variable or column names?

  • Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response.

  • In what year did we see the most total number of births in the U.S.? You can refer to the help files or the R reference card http://cran.r-project.org/doc/contrib/Short-refcard.pdf to find helpful commands.