Some define Statistics as the field that focuses on turning information into knowledge.
The first step in that process is to summarize and describe the raw information - the data.
In this lab, we will show you how to use R to get numerical summaries of data and how to view data by graphs.

1. The Diamonds Dataset

The diamonds dataset is a buildin data set in the ggplot2 library. We can access it using the data function after loading the ggplot2 library.

If you haven’t installed ggplot2, please run the next line. Otherwise, just skip it.

install.packages("ggplot2")  

Let’s load the ggplot library.

library(ggplot2)

Then the diamonds data set can be loaded by the command.

data(diamonds)

The variables in the diamonds data set are

and five physical measurements, depth, table, x, y and z, as shown in Figure 1 below

Figure 1

The dimension of the diamonds dataset is

dim(diamonds)
## [1] 53940    10

from which we can see the data contains 53940 rows and 10 variables.

To view the names of the variables, type the command

str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Here we can see that carat, depth, table, price and x, y, z are numerical variables, and cut, color, and clarity are ordinal categorical variables. Specifically, price is an integer-valued variable.

2. Numerical Summary of Data

To calculate the mean, median, SD, variance, five-number summary, IQR, minimum, maximum of the price variable in the diamonds dataset, type

mean(diamonds$price)
median(diamonds$price)
sd(diamonds$price)
var(diamonds$price)
fivenum(diamonds$price)
IQR(diamonds$price)
min(diamonds$price)
max(diamonds$price)

2.1 The aggregate() Function for Finding Data Summary “By Group”

Rather than finding data summaries over the entire data set, one might be more interested in summarizing data by group. The aggregate() function in R splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form. The aggregate() function accepts modeling language. This involves the use of a tilde (~), which can be read as “is a function of”. For example, one can find the mean price of diamonds by quality of cut

aggregate(price ~ cut , data=diamonds, mean)
##         cut    price
## 1      Fair 4358.758
## 2      Good 3928.864
## 3 Very Good 3981.760
## 4   Premium 4584.258
## 5     Ideal 3457.542

Surprisingly, higher quality cut diamonds are not necessarily more expensive (e.g., mean price of diamonds with the best cut (Ideal) is $3457.5, lower than that of the worst cut (Fair), $4358.7). This is because we didn’t take weight (carat) of diamonds into account. Diamonds with Ideal cut tend to be smaller than diamonds with Fair cut.

aggregate(carat ~ cut , data=diamonds, mean)
##         cut     carat
## 1      Fair 1.0461366
## 2      Good 0.8491847
## 3 Very Good 0.8063814
## 4   Premium 0.8919549
## 5     Ideal 0.7028370

We can find the mean price of diamonds grouped by cut and clarity

aggregate(carat ~ cut + clarity , data=diamonds, mean)
##          cut clarity     carat
## 1       Fair      I1 1.3610000
## 2       Good      I1 1.2030208
## 3  Very Good      I1 1.2819048
## 4    Premium      I1 1.2870244
## 5      Ideal      I1 1.2226712
## 6       Fair     SI2 1.2038412
## 7       Good     SI2 1.0352266
## 8  Very Good     SI2 1.0643381
## 9    Premium     SI2 1.1441607
## 10     Ideal     SI2 1.0079253
## 11      Fair     SI1 0.9646324
## 12      Good     SI1 0.8303974
## 13 Very Good     SI1 0.8459784
## 14   Premium     SI1 0.9086014
## 15     Ideal     SI1 0.8018076
## 16      Fair     VS2 0.8852490
## 17      Good     VS2 0.8507873
## 18 Very Good     VS2 0.8111810
## 19   Premium     VS2 0.8337742
## 20     Ideal     VS2 0.6705660
## 21      Fair     VS1 0.8798235
## 22      Good     VS1 0.7576852
## 23 Very Good     VS1 0.7333070
## 24   Premium     VS1 0.7933082
## 25     Ideal     VS1 0.6747144
## 26      Fair    VVS2 0.6915942
## 27      Good    VVS2 0.6149301
## 28 Very Good    VVS2 0.5663887
## 29   Premium    VVS2 0.6547241
## 30     Ideal    VVS2 0.5862126
## 31      Fair    VVS1 0.6647059
## 32      Good    VVS1 0.5023118
## 33 Very Good    VVS1 0.4945881
## 34   Premium    VVS1 0.5348214
## 35     Ideal    VVS1 0.4959599
## 36      Fair      IF 0.4744444
## 37      Good      IF 0.6163380
## 38 Very Good      IF 0.6187687
## 39   Premium      IF 0.6034783
## 40     Ideal      IF 0.4550413

The tilde (~) syntax also works for median(), sd() , var() , min() , max() , sum() , IQR(), etc,

aggregate(price ~ cut , data=diamonds, median)
aggregate(price ~ cut , data=diamonds, sd)
aggregate(price ~ cut , data=diamonds, var)
aggregate(price ~ cut , data=diamonds, min)
aggregate(price ~ cut , data=diamonds, max)
aggregate(price ~ cut , data=diamonds, IQR)

3. Graphical Display of Data

The ggplot2 library is a powerful R library for making fancy plots and visualizing data. After it’s released in 2007, ggplot2 was adopted worldwide, quickly replaced the build-in R functions: plot(), hist(), boxplot() for making plots, and become the dominating tools for data visualization.

We hence choose to teach students in STAT 220 using ggplot2 The codes for ggplot2 look longer those for build-in R plotting but it’s far more versatile. It surely worth the little extra effort to learn ggplot2.

library(ggplot2) 

All of the codes for ggplot() begin like the following

ggplot(dataframename, aes(x=...,)) + geom_XXX()

The first thing one needs to provide to ggplot() is the name of the data frame. Then one needs to specify the aes() the shorthand for “aesthetics”, which are the variables used for the plot and their roles: x-variable, y-variable, the variable that specifying the color or shape or the points/lines/shades, and so on. The next thing to specify is the type of plot to make:

and so on.

For example, to make a histogram and a boxplot, for the carat variable in the diamonds data,

ggplot(diamonds, aes(x=carat)) + geom_histogram()

ggplot(diamonds, aes(x=carat)) + geom_boxplot()

You only need to specify an x-variable aes(x=carat)for the histogram and the boxplot.

ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

To make a scatter plot of the carat against the price of a diamond, we need to specify both the x- and the y- variable aes(x=carat, y=price)

In the following we are going to specify a bit more detail about histograms, boxplots, and scatter plots.

3.1 Histograms

You can adjust the bin width of the histogram.

ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.1)

ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.02)

3.2 Boxplots

You can change the orientation of the boxplot from horizontal to vertical by changing price from an x-variable to a y-variable.

ggplot(diamonds, aes(x=price)) + geom_boxplot()
ggplot(diamonds, aes(y=price)) + geom_boxplot()

Side-by-Side Boxplots

We can use a side-by-side boxplot to examine the relationship between a categorical variable and a numerical variable. For example, we compare the prices of diamonds with different clarity.

ggplot(diamonds, aes(x=price, y=clarity)) + geom_boxplot()

If we flip price and clarity, the boxplots become vertical.

ggplot(diamonds, aes(x=clarity, y=price)) + geom_boxplot()

It might seem surprising that diamonds with the better clarity (IF, VVS1) have lower price than those with lower clarity. This is because we didn’t adjust for the size of carat, since larger diamonds are more valuable and are more likely to have defects or impurities. If we take diamonds of similar size (e.g., 0.7 to 1 carat), and make a side-by-side boxplot between price and clarity, then diamonds with better clarity generally have higher price.

ggplot(subset(diamonds, carat >= 0.7 & carat < 1), 
       aes(x=clarity, y=price)) + 
  geom_boxplot()

The portion subset(diamonds, carat >= 0.7 & carat < 1) in the R codes above asks R only use a subset of the data with carat between 0.7 and 1 to make the plot rather than the entire data.

You can adjust the range of carat and see if the same relationship persists.

3.3 Scatterplots

In addition to a plain x-y scatter plot, one can make a coded scatter plot, using the color of dots to represent the clarity of diamonds by specifying color=clarity inside aes()

ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

The variables can also be transformed. Here is a coded scatterplot between carat and price, both log-transformed, with the clarity of diamonds represented by the color of dots.

ggplot(diamonds, aes(x=log(carat), y=log(price), color=clarity)) + 
  geom_point()

From the above we can see that for diamonds of the same carat, those with better clarity are more valuable.

In addition color, one can also use the shape, size to represent the 3rd variable, e.g.,

ggplot(diamonds, aes(x=log(carat), y=log(price), shape=clarity)) + geom_point()
ggplot(diamonds, aes(x=log(carat), y=log(price), size=clarity)) + geom_point()

Nonetheless for the diamonds data, using the shape or size of points to represent the clarity is not as clear as using color.

One can see relationship between the four variables: price, arat, clarity and color? We can use the color and shape of dots to represent the clarity and the cut of diamonds.

ggplot(diamonds, aes(x=log(carat), y=log(price), shape=clarity, color=color)) + geom_point()

However, it’s hard to identify the shapes of points and see the effects of clarity from the plot above as most of the points are glued together. It’s better to plot the data “by group” using the facet feature of ggplot. See the next section.

3.4 Facet — Plotting Data “By Group”

Just like the aggregate() function can find data summaries “by group”. One can split data by a grouping variable and make separate histograms/box plots/scatter plots for each group by adding facet_wrap() or facet_grid() to the ggplot.

For example, to see the effect of color on price after accounting for carat and clarity, we can split the data by clarity and make separate scatter plots with log(price) as the y-variable, log(carat) as the x-variable, and the color of diamonds as the color of points.

To split the data by clarity, we need to add facet_wrap(~clarity). Note for color=color in the R codes below, the first color means the color of dots in the plot and the second color means the variable color in the diamonds data.

ggplot(diamonds, aes(x=log(carat),y=log(price),color=color)) + 
  geom_point(size = 0.1)+
  facet_wrap(~clarity)

To inspect the effect of color on the price of diamonds after accounting for clarity and carat, we need to focus on diamonds of the same clarity and carat and see if the price change with the color of diamonds. For the plots above, note all the points in the same sub-plot are at the same clarity level, so we just need to focus on a single sub-plot to fix the clarity effect. Furthermore, points in a sub-plot with the same x-values are the diamonds with the same carat and same clarity. As the blue dots (color = D) always have higher y-values (price) than the yellow dots (color = J), apparently, for diamonds with the same carat and clarity, their price change from color J (yellow dots, least valuable) to color D (blue dots, most valuable).

Similarly, we can examine the effect of cut on the price of diamonds after accounting for carat and clarity using the plot below. The effect of cut on the price of diamonds isn’t as much as color as points of different color are not as well-separated, though we can see those fair-cut diamonds (blue dots) tend to have lower price.

ggplot(diamonds, aes(x=log(carat),y=log(price),color=cut)) + 
  geom_point(size = 0.1)+
  facet_wrap(~clarity)

We can even facet over 2 variables using the facet_grid() command. The plot below split the diamonds by clarity and color. That is, points in the same subplot are of the same clarity and color. In each subplot, the color of dots represent the cut of diamonds. Though the yellow, green and blue dots are not completely separated, we can see that blue dots (cut = “fair”) tend to have low price (low y-value) and yellow dots (cut = “ideal”) tend to have high price (high y-value). consider dots with the same x-value (carat). We can hence conclude that, for diamonds with the same carat, clarity, and color, those with better cut tend to have higher prices than those with worse cut.

ggplot(diamonds, aes(x=log(carat),y=log(price),color=cut)) + 
  geom_point(size = 0.2)+
  facet_grid(clarity~color)+
  theme(legend.position="top")

We can also use facet_wrap() to split the diamonds by the quality of cut and make separate histograms of log(price) for each level of cut by adding facet_wrap(~cut).

ggplot(diamonds, aes(x=log(price))) + 
  geom_histogram(binwidth=0.1) +
  facet_wrap(~cut)

It’s usually better stacking the five histograms on the same horizontal scale. You can do so by specifying nrow=5 within facet_wrap() which will arrange the 5 plots in 5 rows.

ggplot(diamonds, aes(x=log(price))) + 
  geom_histogram(binwidth=0.1) +
  facet_wrap(~cut, nrow=5)

Similarly, we can use facet_wrap for box plots. For example, the following shows the distribution of carat by the color and clarity of diamonds. This allows us to view the relation of 3 variables: carat, clarity and color at the same time.

ggplot(diamonds, aes(x=color,y=carat)) + 
  geom_boxplot() +
  facet_wrap(~clarity, nrow=2)