| age cancer | <20 20-24 25-30 30-34 35+ | Total -----------+-------------------------------------------------------+---------- CA + | 320 1206 1011 463 220 | 3220 CA - | 1422 4432 2893 1092 406 | 10245 -----------+-------------------------------------------------------+---------- Total | 1742 5638 3904 1555 626 | 13465 The variable y is a dummy variable indicating presence or absence of cancer; agegrp is a categorical variable ranging from 0 to 4. Thus each increase of 1 in the agegrp scale corresponds to a five-year increment in age at birth of first child. With the data in this form (separate counts for 1's and for 0's) the logit command in Stata will calculate logistic regression estimates. . logit y agegrp [weight=pop] (frequency weights assumed) Iteration 0: Log Likelihood =-7406.8927 Iteration 1: Log Likelihood =-7343.7041 Iteration 2: Log Likelihood = -7343.45 Iteration 3: Log Likelihood = -7343.45 Logit Estimates Number of obs = 13465 chi2(1) = 126.89 Prob > chi2 = 0.0000 Log Likelihood = -7343.45 Pseudo R2 = 0.0086 ------------------------------------------------------------------------------ y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- agegrp | .2224653 .0196909 11.298 0.000 .1838718 .2610589 _cons | -1.510933 .0382408 -39.511 0.000 -1.585884 -1.435982 ------------------------------------------------------------------------------ Sometimes the data are available in a form that records only the total number in each category at risk together with the number events that occurred for individuals in that category. This alternative way of expressing the data essentially collapses the information in the table from 10 cells to the information contained in 5 columns. To deal with binary data that are arranged in this format, Stata has a separate command, blogit. . clear . input age ncancer ntotal age ncancer ntotal 1. 1 320 1742 2. 2 1206 5638 3. 3 1011 3904 4. 4 463 1555 5. 5 220 626 6. end . gen agegrp = age-1 . blogit ncancer ntotal agegrp Logit Estimates Number of obs = 13465 chi2(1) = 126.89 Prob > chi2 = 0.0000 Log Likelihood = -7343.45 Pseudo R2 = 0.0086 ------------------------------------------------------------------------------ _outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- agegrp | .2224653 .0196909 11.298 0.000 .1838718 .2610589 _cons | -1.510933 .0382408 -39.511 0.000 -1.585884 -1.435982 ------------------------------------------------------------------------------ As with the Poisson regression commands, we can calculate a fitted value for each of the age groups. Recall that the poisson command followed by the predict command generates the linear predictor (log mhat for the Poisson case). After the blogit command, predict generates the ESTIMATED PROBABILITY, as the following example shows. . predict lhat . list age ncancer ntotal agegrp lhat 1. 1 320 1742 0 .1808006 2. 2 1206 5638 1 .2161123 3. 3 1011 3904 2 .2561641 4. 4 463 1555 3 .3007904 5. 5 220 626 4 .3495378 . di exp(-1.510933) /(1+exp(-1.510933)) .18080056 . di exp(-1.510933+.2224653) /(1+exp(-1.510933+.2224653)) .21611228 To generate fitted values for the cell counts in logistic regression, we multiply the fitted probability by the category totals, in this case, the total number of women in each of the age groups. From these, we can calculate the Pearson residuals. The format command tells Stata that there is never a need to print more than two places after the decimal when it is reporting values for the Pearson residuals. . generate mhat = ntotal * lhat . gen diff = ncancer - mhat . gen pres = diff/sqrt(mhat*(1-mhat/ntotal)) . format diff pres %6.2f . list agegrp ntotal ncancer mhat diff pres agegrp ntotal ncancer mhat diff pres 1. 0 1742 320 314.955 5.05 0.31 2. 1 5638 1206 1218.44 -12.44 -0.40 3. 2 3904 1011 1000.06 10.94 0.40 4. 3 1555 463 467.729 -4.73 -0.26 5. 4 626 220 218.811 1.19 0.10 How should we interpret this model? First of all, the fit is excellent. The Pearson chi-squared statistic is 0.4998 on 3 degrees of freedom; the likelihood-ratio chi-squared is 0.4995, also on 3 degrees of freedom. The predicted values tell us that the probability of developing cancer ranges from around 18% in women whose first child is born when they are teenagers to 35% for the oldest first-moms. The coefficient 0.222 is the log of the odds ratio for any two adjacent columns, so that the odds of developing breast cancer increase by the factor exp(0.222) = 1.25 for every additional five-year delay in first birth. That is, the odds increase by 25% per five-years. [Note: a ten-year increase would correspond to odds that are raised by the factor (1.25)*(1.25) = 1.56.] Probit regression ================= Corresponding to the logit and blogit commands are the probit and bprobit commands in Stata to fit linear probit models. Here is the result for fitting the breast cancer data to a probit model: . bprobit ncancer ntotal agegrp Probit Estimates Number of obs = 13465 chi2(1) = 126.81 Prob > chi2 = 0.0000 Log Likelihood = -7343.4855 Pseudo R2 = 0.0086 ------------------------------------------------------------------------------ _outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- agegrp | .1310669 .0116396 11.260 0.000 .1082537 .1538801 _cons | -.9158023 .0220862 -41.465 0.000 -.9590904 -.8725142 ------------------------------------------------------------------------------ . predict lhatp . gen mhatp=lhatp*ntotal . gen diffp = ncancer-mhatp . gen presp = diff/sqrt(mhatp*(1-mhatp/ntotal)) . format presp lhatp diffp %6.2f . list agegrp ntotal ncancer mhatp diffp presp lhatp, nod agegrp ntotal ncancer mhatp diffp presp lhatp 1. 0 1742 320 313.3602 6.64 0.31 0.18 2. 1 5638 1206 1219.524 -13.52 -0.40 0.22 3. 2 3904 1011 1002.011 8.99 0.40 0.26 4. 3 1555 463 467.473 -4.47 -0.26 0.30 5. 4 626 220 217.6608 2.34 0.10 0.35 Note that the fits are virtually identical to those of the logistic regression. . list agegrp lhat lhatp agegrp lhat lhatp 1. 0 0.18 0.18 2. 1 0.22 0.22 3. 2 0.26 0.26 4. 3 0.30 0.30 5. 4 0.35 0.35 How do you interpret the coefficients in the probit regression model? Which model is easier to explain to a physician, a patient, or the general public? Generalized linear models ========================= Both logistic and probit regression models are examples of generalized linear models, which Stata can also estimate. Here are some sample Stata commands that you might wish to try out and compare to the results above. . glm ncancer agegrp, f(bin ntotal) . glm ncancer agegrp, f(bin ntotal) link(probit) . glmpred a . glmpred b, mu . glmpred c, xb . glmpred d, pearson . list agegrp a b c d