There are several ambiguities about the data and the description which
need to be resolved before any sensible data analysis can be done.
The first ambiguity has to do with the last column of Table 4.9,
labelled ``Standard Deviation.'' The quantities that went into this
calculation are not at all clear, so we have to make some informed
guesses. In order to have a coherent notation, let's use the
subscripts i, j, and k to denote, respectively, the size group
of the parent peas, the particular friend to whom Galton gave the
seeds, and the particular plants which grew in each friend's garden.
Let y denote a response measurement (made in hundredths of
an inch), and let n_{ijk} denote the number of peas produced by the
k-th plant grown by friend j using seeds of parent-seed-size i.

One scenario is that Galton had each friend bring in his entire
harvest of peas, grouped, of course, by parent size. Thus, parent
size i would produce n_{i++} offspring seeds. [The notation with
the plus sign replacing a subscript indicates summation over that
subscript.] Thus, if each of nine friends planted 10 seeds, which
produced on average about 40 offspring peas, then n_{i++} would be
approximately 3600 peas. Under this scenario, the diameters of
these n_{i++} peas would be measured, and their mean and SD
calculated. Let's operate under this assumption for our analysis.

**Note.** Other scenarios could be entertained, which would lead
to slightly different analyses. A particularly interesting
alternative to consider is the following: Galton asks each friend to
measure and then to report the *average pea diameter* produced
by each of the ten plants.

First of all, let's see if the assumption we are making about the meaning of the SD column is sensible. All of the SD's are approximately 1.5--2 hundredths of an inch; this suggests that the diameters of the individual peas largely fall in the range 16 plus or minus 4 hundredths of an inch. This is reasonably consistent with the range of parent peas available to Galton, which confirms the plausibility of this interpretation.

If the final column presents SD's of individual peas, however, they do
**not** represent the variability in the errors for the
*mean* offspring sizes reported in the middle column of the
table. Why? Each y_{i} in Table 4.9 is really an average of
n_{i++} items whose SD is approximately 1.5--2 in/100. Thus, our
regression model is

y_{i}= beta_{0}+ beta_{1}x_{i}+ e_{i};where

e_{i}~ (0, (sigma^{2}_{i})/n_{i++})

Since 3600 is quite large, the sample SD of these data will be close
to the population SD (sigma in the model above). Thus, we can use
w_{i} = 1/(SD_{i})^{2} as weights. Once we incorporate these
weights, the (weighted) errors (√{w_{i}} x e_{i}) now have
variances equal to n_{i++}^{-1}. These are also unknown. However,
if we assume that these sample sizes are approximately equal (so that
we can write n_{i++} ≈ N), then the error variance from WLS is
an estimate for N^{-1}. Note that you have to use Thisted's
version of the calculation (`wls`), not the weighted
`regress` command in Stata, to be able to make this
interpretation without additional calculation.

. infile x y sd using alr099 (7 observations read)

. generate w = 1/sd^2

As noted above, the reciprocal of the residual mean square should be an estimate for N; this turns out to be 82.62. That is, we would now guess that each ``standard deviation'' was based on approximately 80--85 measurements. Allowing for ``crop failures,'' this is the number of plants that were produced, and hence. wls y x [aw=w] WLS anova table uses renormalized weights (weight var is w) Source | SS df MS Number of obs = 7 ---------+------------------------------ F( 1, 5) = 28.81 Model | .348722402 1 .348722402 Prob > F = 0.0030 Residual | .060517776 5 .012103555 R-squared = 0.8521 ---------+------------------------------ Adj R-squared = 0.8225 Total | .409240177 6 .068206696 Root MSE = .11002 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x | .2048013 .0381548 5.368 0.003 .1067212 .3028813 _cons | 12.79641 .6811239 18.787 0.000 11.04553 14.5473 ------------------------------------------------------------------------------

So it appears as if Galton's friends reported the average pea size produced by each of their ten (or so) plants in each parent-size group. Suppose that each plant produced exactly n peas. Our argument suggests that the standard deviations in the final column of Table 4.9 are probably estimates of sigma/√{n}, and so are underestimating the actual variability of pea sizes.

First, if we take N ≈ 90, then each observation in our data
set has a standard deviation of approximately
sigma_{i}/√{90}. Using weights of 90/sigma^{2}_{i} implies
that the error mean square should be estimating a quantity very close
to unity. The `vwls` output tests this hypothesis directly.
It is reported as the ``goodness of fit'' chi squared statistic.

. gen stderr = sd/sqrt(90)

Changing N ≈ 90 to other values in the vicinity, say 70--100, doesn't affect the conclusion that there is little evidence for lack of fit for the straight-line model.. vwls y x, sd(stderr) Variance-weighted least-squares regression Number of obs = 7 Goodness-of-fit chi2(5) = 5.45 Model chi2(1) = 31.39 Prob > chi2 = 0.3638 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x | .2048013 .0365571 5.602 0.000 .1331507 .2764518 _cons | 12.79641 .6526019 19.608 0.000 11.51734 14.07549 ------------------------------------------------------------------------------

This conclusion, of course, depends on the assumption that each y
value in the second column of Table 4.9 is the average of around 90
values. If, on the other hand, each of these is an average of
1000--2000 values (whose standard deviations are given in column 3),
then there **is** strong evidence for lack of fit.

A second approach is to fit a parametric version of a nonlinear model, for instance, by adding a quadratic term to the model. (This is a reasonable alternative, based upon inspection of the graph of y vs x.) This approach results in the same conclusion (that there is no definitive evidence to suggest lack of fit), but the evidence here is less clear (p=0.092 for the lack of fit test).

. gen xsq = x^2

. wls y x xsq [aw=w] WLS anova table uses renormalized weights (weight var is w) Source | SS df MS Number of obs = 7 ---------+------------------------------ F( 2, 4) = 28.05 Model | .381998671 2 .190999336 Prob > F = 0.0044 Residual | .027241506 4 .006810376 R-squared = 0.9334 ---------+------------------------------ Adj R-squared = 0.9002 Total | .409240177 6 .068206696 Root MSE = .08253 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- x | -1.133726 .6062199 -1.870 0.135 -2.816863 .5494097 xsq | .0372777 .0168643 2.210 0.092 -.009545 .0841005 _cons | 24.66409 5.393139 4.573 0.010 9.690337 39.63785 ------------------------------------------------------------------------------