Mosaic Package

I assume you have your R (and RStudio) installed on your computer. Before we move on to R codes for Chapter 3, let’s install an R package – mosaic. The mosaic package is designed by a team of NSF-funded educators to make R more user-friendly to introductory statistics students like you.

To use the mosaic package, you will first need to install it (if you haven’t).

install.packages("mosaic")

Note that you will only have to do this once. However, once the package is installed, you will have to load it into the current workspace before it can be used.

library(mosaic)

Note that you will have to do this every time you start a new R session.

Numerical Summaries of Data

Now let’s load the data set (I assume that you know how to change the working directory and know how to load a data file using the read.table() function).

resin = read.table("resin.txt", header=T)

You probably know how to use mean, sd, summary functions to find the mean, SD, and five-number-summary of variables. For experimental data, we are usually more interest in getting those summaries by treatment groups, e.g.,

mean(y ~ tempC, data=resin)
##      175      194      213      231      250 
## 1.932500 1.628750 1.377500 1.194286 1.056667
sd(y ~ tempC, data=resin)
##        175        194        213        231        250 
## 0.06341473 0.10480424 0.10713810 0.04577377 0.13837148
tally( ~ tempC, data=resin)
## tempC
## 175 194 213 231 250 
##   8   8   8   7   6

The above shows that the treatment group at 175 Celcius have 8 observations and the mean of these 8 y’s is 1.9325, and the SD of them is about 0.0634 and so on.

Similar syntax works for var(), median(), max(), min(), IQR(), sum().

The mosaic command favstats, allows us to compute all of these summaries by group at once.

favstats(y ~ tempC, data=resin)
##   tempC  min    Q1 median     Q3  max     mean         sd n missing
## 1   175 1.85 1.895  1.915 1.9700 2.04 1.932500 0.06341473 8       0
## 2   194 1.42 1.595  1.660 1.6725 1.76 1.628750 0.10480424 8       0
## 3   213 1.26 1.300  1.365 1.4175 1.54 1.377500 0.10713810 8       0
## 4   231 1.15 1.165  1.170 1.2150 1.28 1.194286 0.04577377 7       0
## 5   250 0.83 1.030  1.070 1.0875 1.26 1.056667 0.13837148 6       0

Boxplot

The build-in function in R for making boxplots is boxplot.

boxplot(y ~ as.factor(tempC), data=resin, xlab="Temperature (Celcius)", ylab="log10 (Failure time)")

In the mosaic package, there is a more versatile function bwplot for making boxplots.

bwplot(y ~ as.factor(tempC), data=resin, xlab="Temperature (Celcius)", ylab="log10 (Failure time)")

So far, the syntax of boxplot and bwplot look identical. we will see the power of bwplot in Chapter 8.

Scatterplot

The build-in function in R for making scatterplots is plot().

plot(resin$tempC, resin$y)

One downside of the plot function (and many other R build-in functions) is that one needs to constantly refer to the data frame. You can plot using the command plot(tempC, y) because then R will tell you that it cannot find tempC (unless you have attach to the data frame, which is another complication for beginners).

Using the mosaic package, the function for making scatterplots is qplot, and it syntax is more user friendly.

qplot(tempC, y, data=resin)

We should always label the plots properly

qplot(tempC, y, data=resin, xlab="Temperature (Celcius)", ylab="log10 (Failure time)")

Adding Fitted Lines or Curves to Scatter Plots

One advantage of qplot is that you can save it as an object and add more components later on.

p = qplot(tempC, y, data=resin, xlab="Temperature (Celcius)", ylab="log10 (Failure time)") 

Now we add a regression line to the scatter plot

p + geom_smooth(method='lm')

or add a fitted quadratic, cubic, or quartic curve,

p + geom_smooth(method='lm', formula=y~x+I(x^2))

p + geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3))

p + geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3)+I(x^4))

or add the curve of the Arrhenius model.

p + geom_smooth(method='lm', formula=y~I(1/(x+273.15)))

Note in geom_smooth you don’t have to replace the x and y in the formula with the actual variable names like tempC. The geom_smooth function knows the x and y in the formula refer to the x-variable and the y-variable of the scatterplot that you want to add those curves.

All the fitted lines/curves above come with an 95% confidence interval (the gray band around the curve). To remove the confidence interval, simply add se=FALSE, like

p + geom_smooth(method='lm', se=FALSE)

p + geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3), se=FALSE)

One can add all the curves on the same plot

p + geom_smooth(method='lm', se=FALSE, col=1) +
    geom_smooth(method='lm', formula=y~x+I(x^2), se=FALSE, col=2) +
    geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3), se=FALSE, col=3)+
    geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3)+I(x^4), se=FALSE, col=4)+
    geom_smooth(method='lm', formula=y~I(1/(x+273.15)), se=FALSE, col=5)

It looks as if there are only 3 curves, this is because the quadratic, cubic and quartic curve are nearly on top of each other.

Change Axes

The x-axis of the scatterplot should be extended to 120 degree as it is the temperature at which we want to estimate the mean failure time of glue. We can change the x-axis by

p + scale_x_continuous(limits=c(120,250))

As only five temperature levels are used: 175, 194, 213, 231, and 250. We can set the tick marks of the x-axis at those levels.

p120 = p + scale_x_continuous(limits=c(120,250), breaks=c(120,175,194,213,231,250))
p120

Now let’s see what’s the predicted log-failure-time at 120 degree Celcius.

p120 + geom_smooth(method='lm', formula=y~I(1/(x+273.15)), se=F)

However, we the fitted line only extends to 175, but not our desired 120. By default, the fitted lines or curves drawn by geom_smooth only expand the range of data, which makes sense since extrapolation beyond the range of data is generally not recommended. To force R extending the fitted curve to the full range of the plot, we can add fullrange=TRUE.

p120 + geom_smooth(method='lm', formula=y~I(1/(x+273.15)), se=FALSE, fullrange=TRUE)

Again, we add all curves on the scatterplot.

p120 + geom_smooth(method='lm', se=F, col=1, fullrange=T) +
 geom_smooth(method='lm', formula=y~x+I(x^2), se=F, col=2, fullrange=T) +
 geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3), se=F, col=3, fullrange=T)+
    geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3)+I(x^4), se=F, col=4, fullrange=T)+
    geom_smooth(method='lm', formula=y~I(1/(x+273.15)), se=F, col=5, fullrange=T) 

We should add a legend linking line color with models, but I didn’t know how.

Making the Same Scatterplots using Base R (Without mosaic library)

You can also make the same plot using the R build-in plot function instead of using the mosaic library.

To add fitted lines to the scatterplot, first we need to fit those linear and polynomial models.

lm1 = lm(y ~ tempC, data=resin)                                   # linear
lm2 = lm(y ~ tempC+I(tempC^2), data=resin)                        # quardratic
lm3 = lm(y ~ tempC+I(tempC^2)+I(tempC^3), data=resin)             # cubic
lm4 = lm(y ~ tempC+I(tempC^2)+I(tempC^3)+I(tempC^4), data=resin)  # quartic
lmarr = lm(y ~ I(1/(tempC+273.15)), data=resin)                   # Arrhenius 

Add the regression line of the linear model is simple.

plot(resin$tempC, resin$y, xlab="Temperature (Celcius)", ylab="log10 (Failure time)", xlim=c(120,250))
abline(lm1)

To draw a general curve, we can use curve(). For example, we can draw the curve of the function \(f(x)=1-4.5x+2x^3\) by the command

curve(1-4.5*x+2*x^3, from=-2, to=2)

For the quadratic model, the fitted coefficients are

lm2$coefficients
##   (Intercept)         tempC    I(tempC^2) 
##  7.417999e+00 -4.509806e-02  7.860391e-05

So the fitted curve of the quadratic model is \(7.417999 -0.04509806x+0.00007860391x^2\). We can add the fitted curve to the scatter plot by the following commands.

plot(resin$tempC, resin$y, xlab="Temperature (Celcius)", ylab="log10 (Failure time)", xlim=c(120,250))
curve(7.417999 -0.04509806*x+0.00007860391*x^2, add=T)

We want to see the predicted value at 120 degree, but the y-range of the scatterplot above is truncated at the range of the points. We can manually change the y-limits to a higher value.

plot(resin$tempC, resin$y, xlab="Temperature (Celcius)", ylab="log10 (Failure time)", xlim=c(120,250), ylim=c(0.8,3.5))
curve(7.417999 -0.04509806*x+0.00007860391*x^2, add=T)

It is bothersome to manually type the coefficients of the fitted curve. We can save the coefficients to an object coef = lm2$coefficients, then coef[1] will be the value of the first coefficient (usually the intercept), and coef[2] is the second coefficient, and so on.

plot(resin$tempC, resin$y, xlab="Temperature (Celcius)", ylab="log10 (Failure time)", xlim=c(120,250), ylim=c(0.8,3.5))
coef = lm2$coefficients
curve(coef[1]+coef[2]*x+coef[3]*x^2, add=T)

Instead of typing the expression of the curve, an alternative way is to use the predict function. One can compute the predicted values for a model for new observations. For example, the predicted log10(failure time) when tempC=120 can be obtained by the command

predict(lm2, newdata=data.frame(tempC=120))
##        1 
## 3.138128

We can simply replace 120 with x, and plot the predicted value as a function of x.

plot(resin$tempC, resin$y, xlab="Temperature (Celcius)", ylab="log10 (Failure time)", xlim=c(120,250), ylim=c(0.8,3.5))
curve(predict(lm2, newdata=data.frame(tempC=x)), add=T)

Now we added the fitted linear, quadratic, cubic, quartic model, and Arrhenius model, to the scatterplot. Note we use different line types and colors for different models.

plot(resin$tempC, resin$y, xlab="Temperature (Celcius)", ylab="log10 (Failure time)", xlim=c(120,250), ylim=c(0.8,3.2))
curve(predict(lm1, newdata=data.frame(tempC=x)), add=T, lty=1, col=1)
curve(predict(lm2, newdata=data.frame(tempC=x)), add=T, lty=2, col=2)
curve(predict(lm3, newdata=data.frame(tempC=x)), add=T, lty=3, col=3)
curve(predict(lm4, newdata=data.frame(tempC=x)), add=T, lty=4, col=4)
curve(predict(lmarr, newdata=data.frame(tempC=x)), add=T, lty=5, col=5)

Finally, we add a legend to the plot.

plot(resin$tempC, resin$y, xlab="Temperature (Celcius)", ylab="log10 (Failure time)", xlim=c(120,250), ylim=c(0.8,3.2))
curve(predict(lm1, newdata=data.frame(tempC=x)), add=T, lty=1, col=1)
curve(predict(lm2, newdata=data.frame(tempC=x)), add=T, lty=2, col=2)
curve(predict(lm3, newdata=data.frame(tempC=x)), add=T, lty=3, col=3)
curve(predict(lm4, newdata=data.frame(tempC=x)), add=T, lty=4, col=4)
curve(predict(lmarr, newdata=data.frame(tempC=x)), add=T, lty=5, col=5)
legend("topright",c("quadratic","cubic","quartic","Arrhenius","linear"),col=c(3,5,4,2,1),lty=c(3,5,4,2,1))