Example of Poisson regression using Stata Ronald Thisted 1/26/97 The following data set consists of observations on 8553 children born during a single week of March, 1958. The columns represent the mother's age group at the time of the child's birth (<20, 20-29, 30+). The rows reflect the age at which the child was reported to have asthma or wheezing (never, by age 7, between 8 and 11, and from 12 to 16). After enterint the data into Stata, we then give the rows and columns meaningful names with the rename command, and then we also define "labels" for the rows and columns which contain explanatory text corresponding to each level of age at onset (ONSET) and maternal age (MOMAGE). Note that the second version of the table is much more useful than the first. So that we won't have to re-enter all of the labelling information, we have Stata save the data under the data set name "asthma.dta". The entire data set, together with variable names and labels, can be retrieved at a later time by giving the Stata command "use asthma". . tabi 261 4017 2146 \ 103 984 487\ 27 189 95\ 20 157 67 | col row | 1 2 3 | Total -----------+---------------------------------+---------- 1 | 261 4017 2146 | 6424 2 | 103 984 487 | 1574 3 | 27 189 95 | 311 4 | 20 157 67 | 244 -----------+---------------------------------+---------- Total | 411 5347 2795 | 8553 . rename row onset . rename col momage . label define rowlbl 1 "never" 2 "early" 3 "middle" 4 "late" . label define collbl 1 "15-19" 2 "20-29" 3 "30+" . label values momage collbl . label values onset rowlbl . tab onset momage [freq=pop] | momage onset | 15-19 20-29 30+ | Total -----------+---------------------------------+---------- never | 261 4017 2146 | 6424 early | 103 984 487 | 1574 middle | 27 189 95 | 311 late | 20 157 67 | 244 -----------+---------------------------------+---------- Total | 411 5347 2795 | 8553 . save asthma file asthma.dta saved ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ At this point, we fit the independence model to the table. This would imply that age of onset of asthma does not depend on the age of the mother. . xi: poisson pop i.onset i.momage i.onset Ionset_1-4 (naturally coded; Ionset_1 omitted) i.momage Imomag_1-3 (naturally coded; Imomag_1 omitted) Iteration 0: Log Likelihood = -61.117188 Iteration 1: Log Likelihood = -60.683594 Poisson regression Number of obs = 12 Goodness-of-fit chi2(6) = 34.695 Model chi2(5) =15932.734 Prob > chi2 = 0.0000 Prob > chi2 = 0.0000 Log Likelihood = -60.684 Pseudo R2 = 0.9924 ------------------------------------------------------------------------------ pop | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- Ionset_2 | -1.406421 .0281246 -50.007 0.000 -1.461544 -1.351298 Ionset_3 | -3.028003 .0580612 -52.152 0.000 -3.141801 -2.914205 Ionset_4 | -3.270628 .0652229 -50.145 0.000 -3.398462 -3.142793 Imomag_2 | 2.565697 .051187 50.124 0.000 2.465373 2.666022 Imomag_3 | 1.916994 .0528287 36.287 0.000 1.813452 2.020536 _cons | 5.732352 .0497176 115.298 0.000 5.634908 5.829797 ------------------------------------------------------------------------------ Let's consider what the two chi-squared statistics are assessing. The chi-squared in the right column (with the astronomically large value of 15933) is used to compare the independence model (that is, with additive row and column effects only) to the uniformity model (which would have equal expected counts in all twelve cells of the table). The five degrees of freedom associated with this statistic are associated with the five parameter estimates other than the constant term in the model. The data clearly indicate -- without the need for a statistical test! -- that these twelve cells are not equally probable. The tiny p-value and the enormous "pseudo R-squared" statistic are essentially irrelevant. The chi-squared in the left column has six degrees of freedom. This is equal to the number of cells, less the number of parameters estimated in the model (n-k). The chi-squared statistic corresponds to the error sum of squares in an ordinary regression, and the number of degrees of freedom indicate how much additional room for modeling there is in the data set. The chi-squared statistic is a measure of the lack of fit of the current model to the data. It compares the current model to the saturated model; the statistic itself is twice the difference in the log likelihoods. This statistic indicates that the model is not adequate to account for all of the variability in the counts. To explore this further, we calculate fitted values and residuals for each cell of the table. For loglinear models, Stata's predict command calculates the log predicted counts, and stores these in the new variable you name. NOTE: predict does not give the actual expected count (m-hat) itself; you have to compute this for yourself. The Pearson residual is the signed square root of the cell's contribution to the Pearson goodness-of-fit statistic: (o-m)/sqrt(m) , where o=observed count, m=predicted count, and sqrt(m)= the square root of the predicted count. When the model fits well, these residual values should have approximately a N(0,1) distribution. Pearson residuals that are much larger than + or - 2 indicate particular cells where the model does not fit well. The "table" command is a special Stata command for displaying data which can be quite powerful. In the examples that follow, the value for Stata to put in each cell is given in the [iw= ] part of the command. ("IW" stands for "importance weight".) The format given in each case keeps Stata from printing out too many useless values after the decimal point. . predict lmhat . gen mhat = exp(lmhat) . gen pres = (pop - mhat ) / sqrt(mhat) . table onset momage [iw=pop], format(%6.2f) ----------+-------------------------- | momage onset | 15-19 20-29 30+ ----------+-------------------------- never | 261.00 4017.00 2146.00 early | 103.00 984.00 487.00 middle | 27.00 189.00 95.00 late | 20.00 157.00 67.00 ----------+-------------------------- . table onset momage [iw=mhat], format(%6.2f) ----------+-------------------------- | momage onset | 15-19 20-29 30+ ----------+-------------------------- never | 308.69 4016.03 2099.27 early | 75.64 984.00 514.36 middle | 14.94 194.42 101.63 late | 11.73 152.54 79.74 ----------+-------------------------- . table onset momage [iw=pres], format(%6.2f) ----------+-------------------- | momage onset | 15-19 20-29 30+ ----------+-------------------- never | -2.71 0.02 1.02 early | 3.15 -0.00 -1.21 middle | 3.12 -0.39 -0.66 late | 2.42 0.36 -1.43 ----------+-------------------- The table of residuals above shows that there are some serious deficiencies with the independence model, and that the inadequacies of that model seem to be concentrated in the first column -- the column of youngest mothers. There are far too few children of these mothers who are reported to be asthmatic (261 observed, compared to the model prediction of almost 309). And there are many more asthmatic children in each of the age-at-onset groups than independence would have predicted. A number of explanations come to mind. One is that younger mothers are at increased risk for having their children become asthmatic. Another possible explanation is that first children are more likely to be asthmatic, regardless of the age of the mother. [Explain why this might account for the results in the table.] We can fit a model that posits that teenage mothers have a lower probability than other mothers to have a "never asthmatic" child, but aside from this characteristic, the onset-age and maternal age are independent. In effect, we are saying that the the upper left cell of the table is the only place that independence is not operating, and we can model this with a simple interaction term added to the independence model. . gen yn = (momage==1) & (onset==1) . xi: poisson pop i.onset i.momage yn i.onset Ionset_1-4 (naturally coded; Ionset_1 omitted) i.momage Imomag_1-3 (naturally coded; Imomag_1 omitted) Iteration 0: Log Likelihood = -46.378906 Iteration 1: Log Likelihood = -46.339844 Iteration 2: Log Likelihood = -46.335938 Poisson regression Number of obs = 12 Goodness-of-fit chi2(5) = 6.000 Model chi2(6) =15961.430 Prob > chi2 = 0.3062 Prob > chi2 = 0.0000 Log Likelihood = -46.336 Pseudo R2 = 0.9942 ------------------------------------------------------------------------------ pop | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- Ionset_2 | -1.438004 .0288649 -49.818 0.000 -1.494579 -1.38143 Ionset_3 | -3.059587 .0584234 -52.369 0.000 -3.174095 -2.945079 Ionset_4 | -3.302212 .0655456 -50.380 0.000 -3.430679 -3.173745 Imomag_2 | 2.159211 .0850657 25.383 0.000 1.992486 2.325937 Imomag_3 | 1.510508 .0860636 17.551 0.000 1.341827 1.679189 yn | -.582087 .1056676 -5.509 0.000 -.7891918 -.3749822 _cons | 6.146607 .0856401 71.773 0.000 5.978756 6.314459 ------------------------------------------------------------------------------ . predict lmhat2 . gen mhat2=exp(lmhat2) . gen pres2 = (pop - mhat2 ) / sqrt(mhat2) . table onset momage [iw=pres2], format(%6.2f) ----------+-------------------- | momage onset | 15-19 20-29 30+ ----------+-------------------- never | -0.48 0.66 early | -0.75 0.75 -0.68 middle | 1.09 -0.06 -0.43 late | 0.68 0.66 -1.23 ----------+-------------------- . table onset momage [iw=mhat2], format(%6.2f) ----------+-------------------------- | momage onset | 15-19 20-29 30+ ----------+-------------------------- never | 261.00 4047.35 2115.65 early | 110.90 960.85 502.26 middle | 21.91 189.85 99.24 late | 17.19 148.95 77.86 ----------+--------------------------