Basic R Tutorial

Where do I get R

There are versions for Linux, Windows, and Macintosh. All of them are free and can be downloaded at
http://www.r-project.org
and follow the download instruction.

Invoking R

In a Windows system, if properly installed, usually R has a shortcut icon on the desktop screen and/or you can find it under [Start]-[Programs]-[R menu]. If not, search and run the executable file rgui.exe by double clicking from the search result window.

To quit from R, type

> q()

R can be used as a calculator

R is an interactive computing environment which you will use for data analysis. It can also be used as a calculator to perform simple tasks.

> 4*6
[1] 24
> (56/3 + 8.1*1.2)/10
[1] 2.838667
> log(1000)
[1] 6.907755
> sin(1.2)
[1] 0.932039
> pi
[1] 3.141593
> pi/3
[1] 1.047198
> exp(1)
[1] 2.718282

Getting Help

You can type either of

> help(log)
> ?log
> help(sin)
> ?sin
to display the help file for the log, sin (or any other) command.

Creating Variables and Vectors

There are two assignment operators in R: <- and = They can be used interchangeably.
> x <- 12           # storing a scalar value
> x
[1] 12
> x = 12            # "=" work just like "<-"
> x
[1] 12
> # This is a comment.
> # The sign # marks the rest of the line as comments
> # All input on the right of # will be ignored.

Use the generic function "c( , , ,...)" to combine values into a vector or a list.

> a = c(4,2,3,1,5) # creating a vector with five elements 4,2,3,1,5
> a
[1] 4 2 3 1 5
> b = c("Chicago", "New York", "Seatle") # creating a list of three elements
> b
[1] "Chicago"  "New York" "Seatle"
> y = 1:20         # creating a vector with the sequence of y digits 1 through 20
> y
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> y = 5:9          # creating a vector with the sequence of y digits 5 through 9
> y
[1] 5 6 7 8 9
> z = seq(1,2,0.1)  # using the seq command to specify a sequence starting at 1 ending
>                   # at 2 with step-size in 0.1
>                   # syntax: seq(first value, last value, stepsize)
> z
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

Extracting Elements in a Vector

Everything in R is a vector (but some have only one element).

Use [ ] to extract subsets

> z = seq(1,2,0.1)
> z
 [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
> z[5]              # returns the 5th element of z
[1] 1.4

We can put a vector of indices in [ ] to extract more than one element in a vector

> z[c(2,5)]         # returns the 2nd and 5th element of z
[1] 1.1 1.4
> z[c(1,3,8)]       # returns the 1st, 3rd, and 8th element of z
[1] 1.0 1.2 1.7
> z[3:6]            # returns the 3rd to 6th element of z
[1] 1.2 1.3 1.4 1.5

Positive indices select elements, negative indices drop elements

> z[-5]             # return all but the 5th elements in z
[1] 1.0 1.1 1.2 1.3 1.5 1.6 1.7 1.8 1.9 2.0
> z[-c(2,5)]        # return all but the 2nd and 5th elements in z
[1] 1.0 1.2 1.3 1.5 1.6 1.7 1.8 1.9 2.0

Data Import

For example, we want to load the data set ta01_010.txt to R. For convenience, it is better to store the datasets of a project in one working directory. In a Windows system, to change the current working directory, go to [File]-[Change dir...].

A small window will pop-up.

You can change the working directory into any folder you want.

The function read.table is very useful for reading data from an external ascii file into R and storing it in a data frame. For example, if we want to import the data in file ta01_010.txt, we can type

> Cars = read.table("ta01_010.txt", header=T)
Error in file(file, "r") : unable to open connection
In addition: Warning message:
cannot open file 'ta01_010.txt', reason 'No such file or directory' in: file(file, "r") 

Oops! We got warning message and failed to import the data. This is because the command read.table search for files in the working directory by default. There are three solutions:

  1. copying/moving of the ta01_010.txt file to the working directory
  2. changing the working directory to the folder storing the datasets.
  3. giving the full path of the file in the command
    # if the file ta01_010.txt is stored in "E:\Data\Ch01\ta01_010.txt"
    > Cars = read.table("E:\Data\Ch01\ta01_010.txt", header=T) 
    
> Cars = read.table("ta01_010.txt", header=T)
> Cars
   Type City Hwy
1     T   17  24
2     T   20  28
3     T   20  28
4     T   17  25
5     T   18  25
6     T   12  20
7     T   11  16
8     T   10  16
9     T   17  23
10    T   60  66
11    T    9  15
12    T    9  13
13    T   15  22
14    T   12  17
15    T   22  28
16    T   16  23
17    T   13  19
18    T   20  26
19    T   20  29
20    T   15  23
21    T   26  32
22    M   12  19
23    M   21  29
24    M   19  27
25    M   19  28
26    M   16  23
27    M   18  26
28    M   16  23
29    M   18  23
30    M   25  32
31    M   23  31
32    M   20  29
33    M   18  26
34    M   14  22

Now the dataset is imported and stored in a data frame called "Cars". Note that in R, the underscore sign "_" is not allowed as part of an object/variable name. Usually periods are used to separate words in names. Also, R is case sensitive, so HW1 and hw1 are distinct R names.

If the data file has variable names in the first row, we should include the argument header=T in the command. Otherwise, we will get

> Cars = read.table("ta01_010.txt") 
> Cars
     V1   V2  V3
1  Type City Hwy
2     T   17  24
3     T   20  28
4     T   20  28
5     T   17  25
6     T   18  25
7     T   12  20
8     T   11  16
9     T   10  16
....

R will give variable names V1, V2,...automatically, and treat the variable names as the first record in the dataset.

Sometime the variable names aren't included

1 0.2 115 90 1 3 68 42 yes
2 0.7 193 90 3 1 61 48 yes
3 0.2  58 90 1 3 63 40 yes
4 0.2   5 80 2 3 65 75 yes
5 0.2 8.5 90 1 2 64 30 yes

and you want to supply them

> psa = read.table("psa.txt", col.names=c("ptid","nadirpsa", "pretxpsa", "ps","bss","grade","age", "obstime","inrem"))

or

> psa = read.table("psa.txt")
> names(psa) = c("ptid","nadirpsa","pretxpsa", "ps", "bss","grade","age","obstime","inrem"))

names is a function that accesses the variable names of a data frame. For example, the command

> names(Cars)
[1] "Type" "City" "Hwy"
will give the variable names of the data frame Cars.

Sometimes columns are separated by commas, like CSV file created by Excel

Ozone,Solar.R,Wind,Temp,Month,Day
41,190,7.4,67,5,1
36,118,8,72,5,2
12,149,12.6,74,5,3
18,313,11.5,62,5,4
NA,NA,14.3,56,5,5
Then include the argument sep=","
ozone = read.table("ozone.csv", header=TRUE, sep=",")

Some arguments of commands are optional, like sep was not used in the first time. Use

> help(read.table)

to view a complete description of the command and all the arguments.

In R, NA is the code for missing data. Think of it as "Don't Know". R handles it smartly in computations: eg 1+NA = NA, NA*2 = NA.

Listing and Removing objects

ls() will list all the objects, including all vectors, datasets created. Some variables that will not be used any more can be removed from memory. Use rm() to remove variables.

> ls()                        # list all objects 
[1] "a"    "b"    "Cars" "x"    "y"    "z"  
> x                           # display the value of x
[1] 12
> rm(x,y)
> ls()
[1] "a"    "b"    "Cars" "z"  # Variable "x" and "y" are removed
> x                           # "x" is removed and no longer available
Error: object "x" not found   

Saving Workspace -- You can continuous unfinished project

To quit R, type q().

> q()
A dialog window will pop-up.

By selecting [Yes], two hidden files .RData and .Rhistory will be created in the working directory. .Rhistory is a text file storing the history of commands that the user key-in. It can be viewed by any text editor..RData stores all the objects, including datasets, variables created by the user. You can continued unfinished analysis by double-clicking the .RData file, all datasets, variables, and history of commands will be reloaded.

By saving workspace, users can continue their unfinished project next time.

Some Caution about Saving Workspace

Be aware that by saving workspaces from time to time, a lot of datasets/variables will be stored in the workspace and will take a lot of memory and slow down the computer. So when reloading old workspace, be sure to remove unwanted datasets/variables in it. (Use rm().)

When R exports datasets or images, it will store the datasets/images in the working directory, too.

By saving the new workspace, the old workspace will be overwritten. We might want to keep the workspace for a certain project and come back to it later. To solve this problem and to get rid of the burden of continuing cleaning old workspace, a better idea is to create a new working directory for every project/homework and work in it for that project only. Then the workspace, history, output images, for different projects will be saved in different folder and never mix up. One can reload the workspace of a certain project by changing working directory to the corresponding folder.

Viewing/Editing Data

If you prefer working in a spreadsheet as in Excel, use
> fix(Cars)
to open the data frame Cars.

A spreadsheet will pop-up. You can edit entries in the spread sheet. However, the interface is not as convenient as Excel. There are easier ways.

One can view the data value of a single variable in a dataframe. The syntax is dataframename$variablename, eg

> Cars$Hwy        # The variable "Hwy" in the data Cars
 [1] 24 28 28 25 25 20 16 16 23 66 15 13 22 17 28 23 19 26 29 23 32 19 29 27 28
[26] 23 26 23 23 32 31 29 26 22

A data frame can be thought of as an array, though different columns might be of different variable types, for example City and Hwy are numeric(), and Type are character. One can view and edit entries by giving their indices.

> Cars[2,3]       # viewing the entry in the 2nd row and 3rd column
[1] 28
> Cars[2,3] = 21  # change the (2, 3) entry in the data frame Cars
> Cars[2,3] 
[1] 21

The indices can come with more flexibility

> Cars[2,]        # the 2nd row of the data, no constraint on column
  Type City Hwy
2    T   20  28
> Cars[,3]        # the 3rd column of the data, no constraint on row
 [1] 24 28 28 25 25 20 16 16 23 66 15 13 22 17 28 23 19 26 29 23 32 19 29 27 28
[26] 23 26 23 23 32 31 29 26 22
> Cars$Hwy        # Or using the variable name "Hwy" of 3rd column
 [1] 24 28 28 25 25 20 16 16 23 66 15 13 22 17 28 23 19 26 29 23 32 19 29 27 28
[26] 23 26 23 23 32 31 29 26 22
Vectors of positive integers can be used as indices.
> Cars[2:5,2:3]   # the 2nd to 5th row in the 2nd to 3rd column
  City Hwy
2   20  28
3   20  28
4   17  25
5   18  25
> Cars[c(2,5,8),2:3] # the 2nd, 5th, 8th row in the 2nd, 3rd column
  City Hwy
2   20  28
5   18  25
8   10  16
Vectors of negative integers can be used as indices, too.
>  Cars[-(1:20),]    # drop the first 20 rows
   Type City Hwy
21    T   26  32
22    M   12  19
23    M   21  29
24    M   19  27
25    M   19  28
26    M   16  23
27    M   18  26
28    M   16  23
29    M   18  23
30    M   25  32
31    M   23  31
32    M   20  29
33    M   18  26
34    M   14  22
Logical expressions can be used as indices, too.
> Cars[Cars$Type == "M",] # all the minicompact cars
   Type City Hwy
22    M   12  19
23    M   21  29
24    M   19  27
25    M   19  28
26    M   16  23
27    M   18  26
28    M   16  23
29    M   18  23
30    M   25  32
31    M   23  31
32    M   20  29
33    M   18  26
34    M   14  22

Operations on Vectors

Functions/Operators on vectors will be applied to each entry of the vector.

> x = 1:5
> x
[1] 1 2 3 4 5
> 3*x                          # multiplying each element by 3
[1]  3  6  9 12 15
> x/2                          # dividing each element by 2
[1] 0.5 1.0 1.5 2.0 2.5
> x^2                          # square of each element
[1]  1  4  9 16 25
> pi^x                         # returning pi, pi^2, pi^3, pi^4, pi^5
[1]   3.141593   9.869604  31.006277  97.409091 306.019685
> exp(x)
[1]   2.718282   7.389056  20.085537  54.598150 148.413159

For two vectors of the same length, functions/operators will be appled pairwisely.

> x = 1:5
> x
[1] 1 2 3 4 5
> y = c(3,1,4,5,2)
> x*y                   # multiplication element by element
[1]  3  2 12 20 10
> x/y                   # division element by element
[1] 0.3333333 2.0000000 0.7500000 0.8000000 2.5000000
> x^x
[1]    1    4   27  256 3125

Creating New Variables

If we are interested in the difference of highway and city gas mileage of cars, we might want to create a new variable,
> Cars$diff = Cars$Hwy - Cars$City
> Cars
   Type City Hwy diff
1     T   17  24    7
2     T   20  28    8
3     T   20  28    8
4     T   17  25    8
5     T   18  25    7
6     T   12  20    8
7     T   11  16    5
8     T   10  16    6
9     T   17  23    6
Now the data frame has a new variable.

Plots

> plot(Cars$City)            # plot of City mileage vs index number
> plot(Cars$City, Cars$Hwy)  # Scatter plot of City mileage vs Hwy mileage
> stem(Cars$City)            # stem-and-leaf plot of City mileage
  The decimal point is 1 digit(s) to the right of the |

  0 | 99
  1 | 012223455666777888899
  2 | 0000012356
  3 | 
  4 | 
  5 | 
  6 | 0
> hist(Cars$City)            # histogram of City mileage
> qqnorm(Cars$City)          # Normal QQ-plot of City mileage
> boxplot(Cars$City)           # Boxplot
> boxplot(Cars$City, Cars$Hwy) # Parallel Boxplot

Attach and Detach

In the above, we cannot use

> plot(City)
Error in plot(City) : object "City" not found
R cannot see City because it is a variable under the data frame Cars. To let R see City, we need to refer to Cars first. But we might be tired of typing Cars$City all the time. To make life easier, we can type
> attach(Cars)
Then R can see all the variables under Cars. Now we can use
> plot(City, Hwy)
> # 4 boxplots of City and Highway mileage for 2 types of cars
> boxplot(City[Type == "T"], Hwy[Type == "T"], City[Type == "M"], Hwy[Type == "M"]) 
without refering to Cars. The downside of using attach is, if there is a variable also called City,
> City = 1:5
> attach(Cars)

        The following object(s) are masked _by_ .GlobalEnv :

         City 
> City
[1] 1 2 3 4 5
some problem will occur when we attach Cars because R will confuse the variable City and the variable City under Cars. To avoid confusion, when we are done with a data frame, be sure the detach it
> detach(Cars)