| age
    cancer |       <20       20-24     25-30      30-34        35+ |     Total
-----------+-------------------------------------------------------+----------
     CA +  |       320       1206       1011        463        220 |      3220 
     CA -  |      1422       4432       2893       1092        406 |     10245 
-----------+-------------------------------------------------------+----------
     Total |      1742       5638       3904       1555        626 |     13465 


The variable y is a dummy variable indicating presence or absence of 
cancer; agegrp is a categorical variable ranging from 0 to 4.  Thus each 
increase of 1 in the agegrp scale corresponds to a five-year increment in 
age at birth of first child.

With the data in this form (separate counts for 1's and for 0's) the logit 
command in Stata will calculate logistic regression estimates.

. logit y agegrp [weight=pop]
(frequency weights assumed)

Iteration 0:  Log Likelihood =-7406.8927
Iteration 1:  Log Likelihood =-7343.7041
Iteration 2:  Log Likelihood =  -7343.45
Iteration 3:  Log Likelihood =  -7343.45

Logit Estimates                                         Number of obs =  13465
                                                        chi2(1)       = 126.89
                                                        Prob > chi2   = 0.0000
Log Likelihood =   -7343.45                             Pseudo R2     = 0.0086

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  agegrp |   .2224653   .0196909     11.298   0.000       .1838718    .2610589
   _cons |  -1.510933   .0382408    -39.511   0.000      -1.585884   -1.435982
------------------------------------------------------------------------------

Sometimes the data are available in a form that records only the total 
number in each category at risk together with the number events that 
occurred for individuals in that category.  This alternative way of 
expressing the data essentially collapses the information in the table from 
10 cells to the information contained in 5 columns.  To deal with binary 
data that are arranged in this format, Stata has a separate command, 
blogit.

. clear
. input age ncancer ntotal

           age    ncancer     ntotal  
  1. 1 320 1742
  2. 2 1206 5638
  3. 3 1011 3904
  4. 4 463 1555
  5. 5 220 626
  6. end

. gen agegrp = age-1
. blogit ncancer ntotal agegrp

Logit Estimates                                         Number of obs =  13465
                                                        chi2(1)       = 126.89
                                                        Prob > chi2   = 0.0000
Log Likelihood =   -7343.45                             Pseudo R2     = 0.0086

------------------------------------------------------------------------------
_outcome |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  agegrp |   .2224653   .0196909     11.298   0.000       .1838718    .2610589
   _cons |  -1.510933   .0382408    -39.511   0.000      -1.585884   -1.435982
------------------------------------------------------------------------------

As with the Poisson regression commands, we can calculate a fitted value 
for each of the age groups.  Recall that the poisson command followed by 
the predict command generates the linear predictor (log mhat for the 
Poisson case).  After the blogit command, predict generates the ESTIMATED 
PROBABILITY, as the following example shows.

. predict lhat
. list
           age    ncancer     ntotal     agegrp       lhat  
  1.         1        320       1742          0   .1808006  
  2.         2       1206       5638          1   .2161123  
  3.         3       1011       3904          2   .2561641  
  4.         4        463       1555          3   .3007904  
  5.         5        220        626          4   .3495378  

. di exp(-1.510933) /(1+exp(-1.510933))
.18080056

. di exp(-1.510933+.2224653) /(1+exp(-1.510933+.2224653))
.21611228

To generate fitted values for the cell counts in logistic regression, we 
multiply the fitted probability by the category totals, in this case, the 
total number of women in each of the age groups.  From these, we can 
calculate the Pearson residuals.  The format command tells Stata that there 
is never a need to print more than two places after the decimal when it is 
reporting values for the Pearson residuals.

. generate mhat = ntotal * lhat
. gen diff = ncancer - mhat
. gen pres = diff/sqrt(mhat*(1-mhat/ntotal))

. format diff pres %6.2f
. list agegrp ntotal ncancer mhat diff pres

        agegrp     ntotal    ncancer      mhat    diff    pres  
  1.         0       1742        320   314.955    5.05    0.31  
  2.         1       5638       1206   1218.44  -12.44   -0.40  
  3.         2       3904       1011   1000.06   10.94    0.40  
  4.         3       1555        463   467.729   -4.73   -0.26  
  5.         4        626        220   218.811    1.19    0.10  

How should we interpret this model?  First of all, the fit is excellent.  
The Pearson chi-squared statistic is 0.4998 on 3 degrees of freedom; the 
likelihood-ratio chi-squared is 0.4995, also on 3 degrees of freedom.  The 
predicted values tell us that the probability of developing cancer ranges 
from around 18% in women whose first child is born when they are teenagers 
to 35% for the oldest first-moms.  The coefficient 0.222 is the log of the 
odds ratio for any two adjacent columns, so that the odds of developing 
breast cancer increase by the factor exp(0.222) = 1.25 for every 
additional five-year delay in first birth.  That is, the odds increase by 
25% per five-years.  [Note: a ten-year increase would correspond to odds 
that are raised by the factor (1.25)*(1.25) = 1.56.]


Probit regression
=================

Corresponding to the logit and blogit commands are the probit and bprobit 
commands in Stata to fit linear probit models.  Here is the result for 
fitting the breast cancer data to a probit model:

. bprobit ncancer ntotal agegrp

Probit Estimates                                        Number of obs =  13465
                                                        chi2(1)       = 126.81
                                                        Prob > chi2   = 0.0000
Log Likelihood = -7343.4855                             Pseudo R2     = 0.0086

------------------------------------------------------------------------------
_outcome |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  agegrp |   .1310669   .0116396     11.260   0.000       .1082537    .1538801
   _cons |  -.9158023   .0220862    -41.465   0.000      -.9590904   -.8725142
------------------------------------------------------------------------------

. predict lhatp
. gen mhatp=lhatp*ntotal
. gen diffp = ncancer-mhatp
. gen presp = diff/sqrt(mhatp*(1-mhatp/ntotal))

. format presp lhatp diffp %6.2f
. list agegrp ntotal ncancer mhatp diffp presp lhatp, nod

        agegrp     ntotal    ncancer      mhatp   diffp   presp   lhatp  
  1.         0       1742        320   313.3602    6.64    0.31    0.18  
  2.         1       5638       1206   1219.524  -13.52   -0.40    0.22  
  3.         2       3904       1011   1002.011    8.99    0.40    0.26  
  4.         3       1555        463    467.473   -4.47   -0.26    0.30  
  5.         4        626        220   217.6608    2.34    0.10    0.35  

Note that the fits are virtually identical to those of the logistic 
regression.

. list agegrp lhat lhatp

        agegrp    lhat   lhatp  
  1.         0    0.18    0.18  
  2.         1    0.22    0.22  
  3.         2    0.26    0.26  
  4.         3    0.30    0.30  
  5.         4    0.35    0.35  

How do you interpret the coefficients in the probit regression model?  
Which model is easier to explain to a physician, a patient, or the general 
public?


Generalized linear models
=========================

Both logistic and probit regression models are examples of generalized 
linear models, which Stata can also estimate.  Here are some sample Stata 
commands that you might wish to try out and compare to the results above.

. glm ncancer agegrp, f(bin ntotal)
. glm ncancer agegrp, f(bin ntotal) link(probit)
. glmpred a
. glmpred b, mu
. glmpred c, xb
. glmpred d, pearson
. list  agegrp a b c d