Notes on Galton's Sweet Pea Data

Ronald A Thisted

8 November 1998

This is a short commentary on the analysis of Galton's data on the relationship of parent to offspring seed size in sweet peas, as described in Problem 4.1 of Weisberg (page 99).

There are several ambiguities about the data and the description which need to be resolved before any sensible data analysis can be done. The first ambiguity has to do with the last column of Table 4.9, labelled ``Standard Deviation.'' The quantities that went into this calculation are not at all clear, so we have to make some informed guesses. In order to have a coherent notation, let's use the subscripts i, j, and k to denote, respectively, the size group of the parent peas, the particular friend to whom Galton gave the seeds, and the particular plants which grew in each friend's garden. Let y denote a response measurement (made in hundredths of an inch), and let nijk denote the number of peas produced by the k-th plant grown by friend j using seeds of parent-seed-size i.

One scenario is that Galton had each friend bring in his entire harvest of peas, grouped, of course, by parent size. Thus, parent size i would produce ni++ offspring seeds. [The notation with the plus sign replacing a subscript indicates summation over that subscript.] Thus, if each of nine friends planted 10 seeds, which produced on average about 40 offspring peas, then ni++ would be approximately 3600 peas. Under this scenario, the diameters of these ni++ peas would be measured, and their mean and SD calculated. Let's operate under this assumption for our analysis.

Note. Other scenarios could be entertained, which would lead to slightly different analyses. A particularly interesting alternative to consider is the following: Galton asks each friend to measure and then to report the average pea diameter produced by each of the ten plants.

First of all, let's see if the assumption we are making about the meaning of the SD column is sensible. All of the SD's are approximately 1.5--2 hundredths of an inch; this suggests that the diameters of the individual peas largely fall in the range 16 plus or minus 4 hundredths of an inch. This is reasonably consistent with the range of parent peas available to Galton, which confirms the plausibility of this interpretation.

If the final column presents SD's of individual peas, however, they do not represent the variability in the errors for the mean offspring sizes reported in the middle column of the table. Why? Each yi in Table 4.9 is really an average of ni++ items whose SD is approximately 1.5--2 in/100. Thus, our regression model is

yi = beta0 + beta1 xi + ei; where
ei ~ (0, (sigma2i)/ni++)

Since 3600 is quite large, the sample SD of these data will be close to the population SD (sigma in the model above). Thus, we can use wi = 1/(SDi)2 as weights. Once we incorporate these weights, the (weighted) errors (√{wi} x ei) now have variances equal to ni++-1. These are also unknown. However, if we assume that these sample sizes are approximately equal (so that we can write ni++ ≈ N), then the error variance from WLS is an estimate for N-1. Note that you have to use Thisted's version of the calculation (wls), not the weighted regress command in Stata, to be able to make this interpretation without additional calculation.

. infile x y sd using alr099
(7 observations read)

. generate w = 1/sd^2

. wls y x [aw=w]

WLS anova table uses renormalized weights (weight var is w)

  Source |       SS       df       MS                  Number of obs =       7
---------+------------------------------               F(  1,     5) =   28.81
   Model |  .348722402     1  .348722402               Prob > F      =  0.0030
Residual |  .060517776     5  .012103555               R-squared     =  0.8521
---------+------------------------------               Adj R-squared =  0.8225
   Total |  .409240177     6  .068206696               Root MSE      =  .11002

       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
       x |   .2048013   .0381548      5.368   0.003       .1067212    .3028813
   _cons |   12.79641   .6811239     18.787   0.000       11.04553     14.5473
As noted above, the reciprocal of the residual mean square should be an estimate for N; this turns out to be 82.62. That is, we would now guess that each ``standard deviation'' was based on approximately 80--85 measurements. Allowing for ``crop failures,'' this is the number of plants that were produced, and hence not the number of peas produced from a particular parental type!

So it appears as if Galton's friends reported the average pea size produced by each of their ten (or so) plants in each parent-size group. Suppose that each plant produced exactly n peas. Our argument suggests that the standard deviations in the final column of Table 4.9 are probably estimates of sigma/√{n}, and so are underestimating the actual variability of pea sizes.

Lack of fit testing

Weisberg asks for a lack of fit test for the linear model. How should this be done? There are two approaches that I can think of.

First, if we take N ≈ 90, then each observation in our data set has a standard deviation of approximately sigmai/√{90}. Using weights of 90/sigma2i implies that the error mean square should be estimating a quantity very close to unity. The vwls output tests this hypothesis directly. It is reported as the ``goodness of fit'' chi squared statistic.

. gen stderr = sd/sqrt(90)

. vwls y x, sd(stderr)

Variance-weighted least-squares regression           Number of obs   =       7
Goodness-of-fit chi2(5)    =    5.45                 Model chi2(1)   =   31.39
Prob > chi2                =  0.3638                 Prob > chi2     =  0.0000
       y |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
       x |   .2048013   .0365571      5.602   0.000       .1331507    .2764518
   _cons |   12.79641   .6526019     19.608   0.000       11.51734    14.07549
Changing N ≈ 90 to other values in the vicinity, say 70--100, doesn't affect the conclusion that there is little evidence for lack of fit for the straight-line model.

This conclusion, of course, depends on the assumption that each y value in the second column of Table 4.9 is the average of around 90 values. If, on the other hand, each of these is an average of 1000--2000 values (whose standard deviations are given in column 3), then there is strong evidence for lack of fit.

A second approach is to fit a parametric version of a nonlinear model, for instance, by adding a quadratic term to the model. (This is a reasonable alternative, based upon inspection of the graph of y vs x.) This approach results in the same conclusion (that there is no definitive evidence to suggest lack of fit), but the evidence here is less clear (p=0.092 for the lack of fit test).

. gen xsq = x^2

. wls y x xsq [aw=w]

WLS anova table uses renormalized weights (weight var is w)

  Source |       SS       df       MS                  Number of obs =       7
---------+------------------------------               F(  2,     4) =   28.05
   Model |  .381998671     2  .190999336               Prob > F      =  0.0044
Residual |  .027241506     4  .006810376               R-squared     =  0.9334
---------+------------------------------               Adj R-squared =  0.9002
   Total |  .409240177     6  .068206696               Root MSE      =  .08253

       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
       x |  -1.133726   .6062199     -1.870   0.135      -2.816863    .5494097
     xsq |   .0372777   .0168643      2.210   0.092       -.009545    .0841005
   _cons |   24.66409   5.393139      4.573   0.010       9.690337    39.63785