Some define Statistics as the field that focuses on turning
information into knowledge.
The first step in that process is to summarize and describe the raw
information - the data.
In this lab, we will show you how to use R to get numerical summaries of
data and how to view data by graphs.
The diamonds dataset is a buildin data set in the
ggplot2 library. We can access it using the
data function after loading the ggplot2
library.
If you haven’t installed ggplot2, please run the next
line. Otherwise, just skip it.
install.packages("ggplot2") Let’s load the ggplot library.
library(ggplot2)Then the diamonds data set can be loaded by the
command.
data(diamonds)The variables in the diamonds data set are
price: price in US dollarscarat: weight of the diamondcut: quality of the cut (Fair,
Good, Very Good, Premium,
Ideal)color: diamond color, from J (worst) to
D (best)clarity: a measurement of how clear the diamond is,
from I1 (worst), SI1, SI2,
VS1, VS2, VVS1,
VVS2, to IF (best)and five physical measurements, depth,
table, x, y and z,
as shown in Figure 1 below
Figure 1
The dimension of the diamonds dataset is
dim(diamonds)
## [1] 53940 10from which we can see the data contains 53940 rows and 10 variables.
To view the names of the variables, type the command
str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...Here we can see that carat, depth,
table, price and x,
y, z are numerical variables, and
cut, color, and clarity are
ordinal categorical variables. Specifically, price is an
integer-valued variable.
To calculate the mean, median, SD, variance, five-number summary,
IQR, minimum, maximum of the price variable in the
diamonds dataset, type
mean(diamonds$price)
median(diamonds$price)
sd(diamonds$price)
var(diamonds$price)
fivenum(diamonds$price)
IQR(diamonds$price)
min(diamonds$price)
max(diamonds$price)aggregate() Function for Finding Data Summary
“By Group”Rather than finding data summaries over the entire data set, one
might be more interested in summarizing data by group.
The aggregate() function in R splits the data into subsets,
computes summary statistics for each, and returns the result in a
convenient form. The aggregate() function accepts
modeling language. This involves the use of a tilde (~), which
can be read as “is a function of”. For example, one can find the mean
price of diamonds by quality of cut
aggregate(price ~ cut , data=diamonds, mean)
## cut price
## 1 Fair 4358.758
## 2 Good 3928.864
## 3 Very Good 3981.760
## 4 Premium 4584.258
## 5 Ideal 3457.542Surprisingly, higher quality cut diamonds are not necessarily more
expensive (e.g., mean price of diamonds with the best cut (Ideal) is
$3457.5, lower than that of the worst cut (Fair), $4358.7). This is
because we didn’t take weight (carat) of diamonds into
account. Diamonds with Ideal cut tend to be smaller than diamonds with
Fair cut.
aggregate(carat ~ cut , data=diamonds, mean)
## cut carat
## 1 Fair 1.0461366
## 2 Good 0.8491847
## 3 Very Good 0.8063814
## 4 Premium 0.8919549
## 5 Ideal 0.7028370We can find the mean price of diamonds grouped by cut
and clarity
aggregate(carat ~ cut + clarity , data=diamonds, mean)
## cut clarity carat
## 1 Fair I1 1.3610000
## 2 Good I1 1.2030208
## 3 Very Good I1 1.2819048
## 4 Premium I1 1.2870244
## 5 Ideal I1 1.2226712
## 6 Fair SI2 1.2038412
## 7 Good SI2 1.0352266
## 8 Very Good SI2 1.0643381
## 9 Premium SI2 1.1441607
## 10 Ideal SI2 1.0079253
## 11 Fair SI1 0.9646324
## 12 Good SI1 0.8303974
## 13 Very Good SI1 0.8459784
## 14 Premium SI1 0.9086014
## 15 Ideal SI1 0.8018076
## 16 Fair VS2 0.8852490
## 17 Good VS2 0.8507873
## 18 Very Good VS2 0.8111810
## 19 Premium VS2 0.8337742
## 20 Ideal VS2 0.6705660
## 21 Fair VS1 0.8798235
## 22 Good VS1 0.7576852
## 23 Very Good VS1 0.7333070
## 24 Premium VS1 0.7933082
## 25 Ideal VS1 0.6747144
## 26 Fair VVS2 0.6915942
## 27 Good VVS2 0.6149301
## 28 Very Good VVS2 0.5663887
## 29 Premium VVS2 0.6547241
## 30 Ideal VVS2 0.5862126
## 31 Fair VVS1 0.6647059
## 32 Good VVS1 0.5023118
## 33 Very Good VVS1 0.4945881
## 34 Premium VVS1 0.5348214
## 35 Ideal VVS1 0.4959599
## 36 Fair IF 0.4744444
## 37 Good IF 0.6163380
## 38 Very Good IF 0.6187687
## 39 Premium IF 0.6034783
## 40 Ideal IF 0.4550413The tilde (~) syntax also works for
median(), sd() , var() ,
min() , max() , sum() ,
IQR(), etc,
aggregate(price ~ cut , data=diamonds, median)
aggregate(price ~ cut , data=diamonds, sd)
aggregate(price ~ cut , data=diamonds, var)
aggregate(price ~ cut , data=diamonds, min)
aggregate(price ~ cut , data=diamonds, max)
aggregate(price ~ cut , data=diamonds, IQR)The ggplot2 library is a powerful R library for making
fancy plots and visualizing data. After it’s released in 2007,
ggplot2 was adopted worldwide, quickly replaced the
build-in R functions: plot(), hist(),
boxplot() for making plots, and become the dominating tools
for data visualization.
We hence choose to teach students in STAT 220 using
ggplot2 The codes for ggplot2 look longer
those for build-in R plotting but it’s far more versatile. It surely
worth the little extra effort to learn ggplot2.
library(ggplot2) All of the codes for ggplot() begin like the following
ggplot(dataframename, aes(x=...,)) + geom_XXX()The first thing one needs to provide to ggplot() is the
name of the data frame. Then one needs to specify the aes()
the shorthand for “aesthetics”, which are the variables used for the
plot and their roles: x-variable, y-variable, the variable that
specifying the color or shape or the
points/lines/shades, and so on. The next thing to specify is the type of
plot to make:
and so on.
For example, to make a histogram and a boxplot, for the
carat variable in the diamonds data,
ggplot(diamonds, aes(x=carat)) + geom_histogram()ggplot(diamonds, aes(x=carat)) + geom_boxplot()You only need to specify an x-variable aes(x=carat)for
the histogram and the boxplot.
ggplot(diamonds, aes(x=carat, y=price)) + geom_point()To make a scatter plot of the carat against the
price of a diamond, we need to specify both the x- and the
y- variable aes(x=carat, y=price)
In the following we are going to specify a bit more detail about histograms, boxplots, and scatter plots.
You can adjust the bin width of the histogram.
ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.1)ggplot(diamonds, aes(x=carat)) + geom_histogram(binwidth=0.02)You can change the orientation of the boxplot from horizontal to
vertical by changing price from an x-variable to a
y-variable.
ggplot(diamonds, aes(x=price)) + geom_boxplot()
ggplot(diamonds, aes(y=price)) + geom_boxplot()We can use a side-by-side boxplot to examine the relationship between a categorical variable and a numerical variable. For example, we compare the prices of diamonds with different clarity.
ggplot(diamonds, aes(x=price, y=clarity)) + geom_boxplot()If we flip price and clarity, the boxplots
become vertical.
ggplot(diamonds, aes(x=clarity, y=price)) + geom_boxplot()It might seem surprising that diamonds with the better clarity (IF, VVS1) have lower price than those with lower clarity. This is because we didn’t adjust for the size of carat, since larger diamonds are more valuable and are more likely to have defects or impurities. If we take diamonds of similar size (e.g., 0.7 to 1 carat), and make a side-by-side boxplot between price and clarity, then diamonds with better clarity generally have higher price.
ggplot(subset(diamonds, carat >= 0.7 & carat < 1),
aes(x=clarity, y=price)) +
geom_boxplot()The portion
subset(diamonds, carat >= 0.7 & carat < 1) in the
R codes above asks R only use a subset of the data with carat between
0.7 and 1 to make the plot rather than the entire data.
You can adjust the range of carat and see if the same relationship persists.
In addition to a plain x-y scatter plot, one can make a coded scatter
plot, using the color of dots to represent the clarity of diamonds by
specifying color=clarity inside aes()
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()The variables can also be transformed. Here is a coded scatterplot between carat and price, both log-transformed, with the clarity of diamonds represented by the color of dots.
ggplot(diamonds, aes(x=log(carat), y=log(price), color=clarity)) +
geom_point()From the above we can see that for diamonds of the same carat, those with better clarity are more valuable.
In addition color, one can also use the
shape, size to represent the 3rd variable,
e.g.,
ggplot(diamonds, aes(x=log(carat), y=log(price), shape=clarity)) + geom_point()
ggplot(diamonds, aes(x=log(carat), y=log(price), size=clarity)) + geom_point()Nonetheless for the diamonds data, using the shape or size of points to represent the clarity is not as clear as using color.
One can see relationship between the four variables:
price, arat, clarity and
color? We can use the color and shape of dots to represent
the clarity and the cut of diamonds.
ggplot(diamonds, aes(x=log(carat), y=log(price), shape=clarity, color=color)) + geom_point()However, it’s hard to identify the shapes of points and see the
effects of clarity from the plot above as most of the
points are glued together. It’s better to plot the data “by group” using
the facet feature of ggplot. See the next
section.
Just like the aggregate() function can find data
summaries “by group”. One can split data by a grouping variable and make
separate histograms/box plots/scatter plots for each group by adding
facet_wrap() or facet_grid() to the
ggplot.
For example, to see the effect of color on
price after accounting for carat and
clarity, we can split the data by clarity and
make separate scatter plots with log(price) as the
y-variable, log(carat) as the x-variable, and the
color of diamonds as the color of points.
To split the data by clarity, we need to add
facet_wrap(~clarity). Note for color=color in
the R codes below, the first color means the color of dots
in the plot and the second color means the variable
color in the diamonds data.
ggplot(diamonds, aes(x=log(carat),y=log(price),color=color)) +
geom_point(size = 0.1)+
facet_wrap(~clarity)To inspect the effect of color on the price of diamonds
after accounting for clarity and carat, we
need to focus on diamonds of the same clarity and
carat and see if the price change with the
color of diamonds. For the plots above, note all the points
in the same sub-plot are at the same clarity level, so we
just need to focus on a single sub-plot to fix the clarity
effect. Furthermore, points in a sub-plot with the same x-values are the
diamonds with the same carat and same clarity.
As the blue dots (color = D) always have
higher y-values (price) than the yellow dots
(color = J), apparently, for diamonds with the
same carat and clarity, their price change from color J (yellow dots,
least valuable) to color D (blue dots, most valuable).
Similarly, we can examine the effect of cut on the price
of diamonds after accounting for carat and
clarity using the plot below. The effect of
cut on the price of diamonds isn’t as much as
color as points of different color are not as
well-separated, though we can see those fair-cut diamonds
(blue dots) tend to have lower price.
ggplot(diamonds, aes(x=log(carat),y=log(price),color=cut)) +
geom_point(size = 0.1)+
facet_wrap(~clarity)We can even facet over 2 variables using the
facet_grid() command. The plot below split the diamonds by
clarity and color. That is, points in the same
subplot are of the same clarity and color. In
each subplot, the color of dots represent the cut of
diamonds. Though the yellow, green and blue dots are not completely
separated, we can see that blue dots (cut = “fair”) tend to
have low price (low y-value) and yellow dots (cut =
“ideal”) tend to have high price (high y-value). consider dots with the
same x-value (carat). We can hence conclude that, for
diamonds with the same carat, clarity, and
color, those with better cut tend to have higher prices
than those with worse cut.
ggplot(diamonds, aes(x=log(carat),y=log(price),color=cut)) +
geom_point(size = 0.2)+
facet_grid(clarity~color)+
theme(legend.position="top")We can also use facet_wrap() to split the diamonds by
the quality of cut and make separate histograms of
log(price) for each level of cut by adding
facet_wrap(~cut).
ggplot(diamonds, aes(x=log(price))) +
geom_histogram(binwidth=0.1) +
facet_wrap(~cut)It’s usually better stacking the five histograms on the same
horizontal scale. You can do so by specifying nrow=5 within
facet_wrap() which will arrange the 5 plots in 5 rows.
ggplot(diamonds, aes(x=log(price))) +
geom_histogram(binwidth=0.1) +
facet_wrap(~cut, nrow=5)Similarly, we can use facet_wrap for box plots. For
example, the following shows the distribution of carat by
the color and clarity of diamonds. This allows
us to view the relation of 3 variables: carat,
clarity and color at the same time.
ggplot(diamonds, aes(x=color,y=carat)) +
geom_boxplot() +
facet_wrap(~clarity, nrow=2)