Some define Statistics as the field that focuses on turning
information into knowledge.
The first step in that process is to summarize and describe the raw
information - the data.
In this lab, we will show you how to use R to get numerical summaries of
data and how to view data by graphs.
The diamonds
dataset is a buildin data set in the
ggplot2
library. We can access it using the
data
function after loading the ggplot2
library.
If you haven’t installed ggplot2
, please run the next
line. Otherwise, just skip it.
install.packages("ggplot2")
Let’s load the ggplot
library.
library(ggplot2)
Then the diamonds
data set can be loaded by the
command.
data(diamonds)
The variables in the diamonds
data set are
price
: price in US dollarscarat
: weight of the diamondcut
: quality of the cut (Fair
,
Good
, Very Good
, Premium
,
Ideal)color
: diamond color, from J
(worst) to
D
(best)clarity
: a measurement of how clear the diamond is,
from I1
(worst), SI1
, SI2
,
VS1
, VS2
, VVS1
,
VVS2
, to IF
(best)and five physical measurements, depth
,
table
, x
, y
and z
,
as shown in Figure 1 below
The dimension of the diamonds
dataset is
dim(diamonds)
## [1] 53940 10
from which we can see the data contains 53940 rows and 10 variables.
To view the names of the variables, type the command
str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Here we can see that carat
, depth
,
table
, price
and x
,
y
, z
are numerical variables, and
cut
, color
, and clarity
are
ordinal categorical variables. Specifically, price
is an
integer-valued variable.
To calculate the mean, median, SD, variance, five-number summary,
IQR, minimum, maximum of the price
variable in the
diamonds
dataset, type
mean(diamonds$price)
median(diamonds$price)
sd(diamonds$price)
var(diamonds$price)
fivenum(diamonds$price)
IQR(diamonds$price)
min(diamonds$price)
max(diamonds$price)
aggregate()
Function for Finding Data Summary
“By Group”Rather than finding data summaries over the entire data set, one
might be more interested in summarizing data by group.
The aggregate()
function in R splits the data into subsets,
computes summary statistics for each, and returns the result in a
convenient form. The aggregate()
function accepts
modeling language. This involves the use of a tilde (~), which
can be read as “is a function of”. For example, one can find the mean
price of diamonds by quality of cut
aggregate(price ~ cut , data=diamonds, mean)
## cut price
## 1 Fair 4358.758
## 2 Good 3928.864
## 3 Very Good 3981.760
## 4 Premium 4584.258
## 5 Ideal 3457.542
Surprisingly, higher quality cut diamonds are not necessarily more
expensive (e.g., mean price of diamonds with the best cut (Ideal) is
$3457.5, lower than that of the worst cut (Fair), $4358.7). This is
because we didn’t take weight (carat
) of diamonds into
account. Diamonds with Ideal cut tend to be smaller than diamonds with
Fair cut.
aggregate(carat ~ cut , data=diamonds, mean)
## cut carat
## 1 Fair 1.0461366
## 2 Good 0.8491847
## 3 Very Good 0.8063814
## 4 Premium 0.8919549
## 5 Ideal 0.7028370
We can find the mean price of diamonds grouped by cut
and clarity
aggregate(carat ~ cut + clarity , data=diamonds, mean)
## cut clarity carat
## 1 Fair I1 1.3610000
## 2 Good I1 1.2030208
## 3 Very Good I1 1.2819048
## 4 Premium I1 1.2870244
## 5 Ideal I1 1.2226712
## 6 Fair SI2 1.2038412
## 7 Good SI2 1.0352266
## 8 Very Good SI2 1.0643381
## 9 Premium SI2 1.1441607
## 10 Ideal SI2 1.0079253
## 11 Fair SI1 0.9646324
## 12 Good SI1 0.8303974
## 13 Very Good SI1 0.8459784
## 14 Premium SI1 0.9086014
## 15 Ideal SI1 0.8018076
## 16 Fair VS2 0.8852490
## 17 Good VS2 0.8507873
## 18 Very Good VS2 0.8111810
## 19 Premium VS2 0.8337742
## 20 Ideal VS2 0.6705660
## 21 Fair VS1 0.8798235
## 22 Good VS1 0.7576852
## 23 Very Good VS1 0.7333070
## 24 Premium VS1 0.7933082
## 25 Ideal VS1 0.6747144
## 26 Fair VVS2 0.6915942
## 27 Good VVS2 0.6149301
## 28 Very Good VVS2 0.5663887
## 29 Premium VVS2 0.6547241
## 30 Ideal VVS2 0.5862126
## 31 Fair VVS1 0.6647059
## 32 Good VVS1 0.5023118
## 33 Very Good VVS1 0.4945881
## 34 Premium VVS1 0.5348214
## 35 Ideal VVS1 0.4959599
## 36 Fair IF 0.4744444
## 37 Good IF 0.6163380
## 38 Very Good IF 0.6187687
## 39 Premium IF 0.6034783
## 40 Ideal IF 0.4550413
The tilde (~
) syntax also works for
median()
, sd()
, var()
,
min()
, max()
, sum()
,
IQR()
, etc,
aggregate(price ~ cut , data=diamonds, median)
aggregate(price ~ cut , data=diamonds, sd)
aggregate(price ~ cut , data=diamonds, var)
aggregate(price ~ cut , data=diamonds, min)
aggregate(price ~ cut , data=diamonds, max)
aggregate(price ~ cut , data=diamonds, IQR)
The ggplot2
library is a powerful R library for making
fancy plots and visualizing data. After it’s released in 2007,
ggplot2
was adopted worldwide, quickly replaced the
build-in R functions: plot()
, hist()
,
boxplot()
for making plots, and become the dominating tools
for data visualization.
We hence choose to teach students in STAT 220 using
ggplot2
The codes for ggplot2
look longer
those for build-in R plotting but it’s far more versatile. It surely
worth the little extra effort to learn ggplot2
.
library(ggplot2)
All of the codes for ggplot() begin like the following
ggplot(dataframename, aes(x=...,)) + geom_XXX()
The first thing one needs to provide to ggplot()
is the
name of the data frame. Then one needs to specify the aes()
the shorthand for “aesthetics”, which are the variables used for the
plot and their roles: x-variable, y-variable, the variable that
specifying the color or shape or the
points/lines/shades, and so on. The next thing to specify is the type of
plot to make:
and so on.
For example, to make a histogram and a boxplot, for the
carat
variable in the diamonds
data,
ggplot(diamonds, aes(x=carat)) + geom_histogram()
ggplot(diamonds, aes(x=carat)) + geom_boxplot()
You only need to specify an x-variable aes(x=carat)
for
the histogram and the boxplot.
ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
To make a scatter plot of the carat
against the
price
of a diamond, we need to specify both the x- and the
y- variable aes(x=carat, y=price)
In the following we are going to specify a bit more detail about histograms, boxplots, and scatter plots.
You can adjust the bin width of the histogram.
ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.1)
ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.02)
You can change the orientation of the boxplot from horizontal to
vertical by changing price
from an x-variable to a
y-variable.
ggplot(diamonds, aes(x=price)) + geom_boxplot()
ggplot(diamonds, aes(y=price)) + geom_boxplot()
We can use a side-by-side boxplot to examine the relationship between a categorical variable and a numerical variable. For example, we compare the prices of diamonds with different clarity.
ggplot(diamonds, aes(x=price, y=clarity)) + geom_boxplot()
If we flip price
and clarity
, the boxplots
become vertical.
ggplot(diamonds, aes(x=clarity, y=price)) + geom_boxplot()
It might seem surprising that diamonds with the better clarity (IF, VVS1) have lower price than those with lower clarity. This is because we didn’t adjust for the size of carat, since larger diamonds are more valuable and are more likely to have defects or impurities. If we take diamonds of similar size (e.g., 0.7 to 1 carat), and make a side-by-side boxplot between price and clarity, then diamonds with better clarity generally have higher price.
ggplot(subset(diamonds, carat >= 0.7 & carat < 1),
aes(x=clarity, y=price)) +
geom_boxplot()
The portion
subset(diamonds, carat >= 0.7 & carat < 1)
in the
R codes above asks R only use a subset of the data with carat between
0.7 and 1 to make the plot rather than the entire data.
You can adjust the range of carat and see if the same relationship persists.
In addition to a plain x-y scatter plot, one can make a coded scatter
plot, using the color of dots to represent the clarity of diamonds by
specifying color=clarity
inside aes()
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()
The variables can also be transformed. Here is a coded scatterplot between carat and price, both log-transformed, with the clarity of diamonds represented by the color of dots.
ggplot(diamonds, aes(x=log(carat), y=log(price), color=clarity)) +
geom_point()
From the above we can see that for diamonds of the same carat, those with better clarity are more valuable.
In addition color
, one can also use the
shape
, size
to represent the 3rd variable,
e.g.,
ggplot(diamonds, aes(x=log(carat), y=log(price), shape=clarity)) + geom_point()
ggplot(diamonds, aes(x=log(carat), y=log(price), size=clarity)) + geom_point()
Nonetheless for the diamonds data, using the shape or size of points to represent the clarity is not as clear as using color.
One can see relationship between the four variables:
price
, arat
, clarity
and
color
? We can use the color and shape of dots to represent
the clarity and the cut of diamonds.
ggplot(diamonds, aes(x=log(carat), y=log(price), shape=clarity, color=color)) + geom_point()
However, it’s hard to identify the shapes of points and see the
effects of clarity
from the plot above as most of the
points are glued together. It’s better to plot the data “by group” using
the facet
feature of ggplot
. See the next
section.
Just like the aggregate()
function can find data
summaries “by group”. One can split data by a grouping variable and make
separate histograms/box plots/scatter plots for each group by adding
facet_wrap()
or facet_grid()
to the
ggplot.
For example, to see the effect of color
on
price
after accounting for carat
and
clarity
, we can split the data by clarity
and
make separate scatter plots with log(price)
as the
y-variable, log(carat)
as the x-variable, and the
color
of diamonds as the color of points.
To split the data by clarity
, we need to add
facet_wrap(~clarity)
. Note for color=color
in
the R codes below, the first color
means the color of dots
in the plot and the second color
means the variable
color
in the diamonds
data.
ggplot(diamonds, aes(x=log(carat),y=log(price),color=color)) +
geom_point(size = 0.1)+
facet_wrap(~clarity)
To inspect the effect of color
on the price of diamonds
after accounting for clarity
and carat
, we
need to focus on diamonds of the same clarity
and
carat
and see if the price
change with the
color
of diamonds. For the plots above, note all the points
in the same sub-plot are at the same clarity
level, so we
just need to focus on a single sub-plot to fix the clarity
effect. Furthermore, points in a sub-plot with the same x-values are the
diamonds with the same carat
and same clarity
.
As the blue dots (color
= D
) always have
higher y-values (price
) than the yellow dots
(color
= J
), apparently, for diamonds with the
same carat and clarity, their price change from color J (yellow dots,
least valuable) to color D (blue dots, most valuable).
Similarly, we can examine the effect of cut
on the price
of diamonds after accounting for carat
and
clarity
using the plot below. The effect of
cut
on the price of diamonds isn’t as much as
color
as points of different color are not as
well-separated, though we can see those fair-cut
diamonds
(blue dots) tend to have lower price.
ggplot(diamonds, aes(x=log(carat),y=log(price),color=cut)) +
geom_point(size = 0.1)+
facet_wrap(~clarity)
We can even facet
over 2 variables using the
facet_grid()
command. The plot below split the diamonds by
clarity
and color
. That is, points in the same
subplot are of the same clarity
and color
. In
each subplot, the color of dots represent the cut
of
diamonds. Though the yellow, green and blue dots are not completely
separated, we can see that blue dots (cut
= “fair”) tend to
have low price (low y-value) and yellow dots (cut
=
“ideal”) tend to have high price (high y-value). consider dots with the
same x-value (carat
). We can hence conclude that, for
diamonds with the same carat
, clarity
, and
color
, those with better cut tend to have higher prices
than those with worse cut.
ggplot(diamonds, aes(x=log(carat),y=log(price),color=cut)) +
geom_point(size = 0.2)+
facet_grid(clarity~color)+
theme(legend.position="top")
We can also use facet_wrap()
to split the diamonds by
the quality of cut
and make separate histograms of
log(price)
for each level of cut
by adding
facet_wrap(~cut)
.
ggplot(diamonds, aes(x=log(price))) +
geom_histogram(binwidth=0.1) +
facet_wrap(~cut)
It’s usually better stacking the five histograms on the same
horizontal scale. You can do so by specifying nrow=5
within
facet_wrap()
which will arrange the 5 plots in 5 rows.
ggplot(diamonds, aes(x=log(price))) +
geom_histogram(binwidth=0.1) +
facet_wrap(~cut, nrow=5)
Similarly, we can use facet_wrap
for box plots. For
example, the following shows the distribution of carat
by
the color
and clarity
of diamonds. This allows
us to view the relation of 3 variables: carat
,
clarity
and color
at the same time.
ggplot(diamonds, aes(x=color,y=carat)) +
geom_boxplot() +
facet_wrap(~clarity, nrow=2)