SPERMSEG: analysis of segregation distortion in single sperm data

Copyright (C) 1999 Mary Sara McPeek

=========================================================

LICENSE

This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY of FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details.

You should have received a copy of the GNU General Public License along
with this program (see file gpl.txt); if not, write to the Free 
Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 
02111-1307, USA.

I request that use of this software be cited in publications as M.S.
McPeek (1999) "SPERMSEG: analysis of segregation distortion
in single-sperm data" American Journal of Human Genetics 65:1195-1197.

To contact the author:

Mary Sara McPeek
Department of Statistics
5734 S. University Ave.
Chicago, IL 60637
email: mcpeek@galton.uchicago.edu

========================================================

TABLE OF CONTENTS

I. COMPILING SPERMSEG

II. RUNNING SPERMSEG

III. GENERAL NOTES

IV. INPUT FILE

V. DATA FILE

VI. MODEL FILE

VII. OPTIONS FILE

VIII. STARTING VALUE FILE

IX. RESTRICTIONS ON MODELS

X. ERROR MESSAGES

XI. WARNINGS GENERATED BY THE PROGRAM

XII.  EXAMPLE DATA SETS/FILES

======================================

I. COMPILING SPERMSEG

All you really need is spermseg.tar.gz.  Uncompress it using the command
"gunzip spermseg.tar.gz".  Then enter the directory where you want to 
keep the program and issue the command "tar xvf spermseg.tar".  If you 
have problems using gunzip or tar, you can just take the individual 
files from this directory.  

Edit the Makefile as necessary according to the instructions in the 
Makefile.  You should only need to make sure that the correct compiler 
and compiler options for your machine are chosen.  To create the 
executable file, within the same directory as these files issue the 
command "make".  

The program has been developed and tested on the following platforms:

Solaris (5.6 and 5.7)
SGI IRIX (6.4)
Linux (Red Hat 5.2)

The code was written to be highly portable.  If you have problems
compiling on your system, you should first talk to your system
administrator who will be most familiar with importing software to your
system.  If you have comments or questions, you can send them to

mcpeek@galton.uchicago.edu

II. RUNNING SPERMSEG

The name of the executable file is spermseg.
To run the program, make sure the executable file, input file, 
data file, model file (and, if used, options file and starting value 
file) are all in the same directory.  In that directory, type spermseg,
followed by the name of the input file, on the command line.  For
instance, to run the program using the first example input 
provided, enter the SPERMSEG directory and type

spermseg input.test1

III. GENERAL NOTES on the input, data, model, options, and starting value
files:
The information in these files is read in line by line, with extensive 
error checking.  In particular, if the program encounters a blank 
line where it is expecting some information in one of these files, or if
it reaches the end of a line without finding all the information
expected on that line, you will get an explicit error message, telling
exactly where the problem occurred.  Any amount of
white space can separate items on the same line.  Once the program has 
found all the information it is expecting
on a certain line, the rest of the line is ignored, so comments can be
made there.  In the example files, comments are indicated by the symbols
<<.

IV. INPUT FILE

See the example files input.test1, input.test2, and input.test3.

LINE 1:  The input file must have the name of the data file on the 
first line.
LINE 2:  The name of the model file goes on the second line.  
LINE 3:  The third line tells the program whether any special options
will be used.  The available options fall into 7 categories:
confidence intervals, output of cell probabilities or expected cell
counts for the data, output of separate log-likelihoods under the model for
the individual data sets, calculation of log-likelihood of the data under
full multinomial model, user-specified number of search steps for
creation of confidence intervals, user-specified starting values for 
the EM algorithm, simulation to assess goodness of fit of the chosen 
model to the data.  If none of these options are desired, enter a "0" on 
the third line.  If one or more options are desired, enter a "1" followed 
by the name of the options file.  Details on these options are given
below under OPTIONS FILE.

V. DATA FILE

See the example files data.test1 and data.test3.

LINE 1:  The data file must have the number of two-marker data sets on
the first line.  Call this number n2mark.
IF THE NUMBER ON LINE 1 IS 0, SKIP DOWN TO LINE (n2mark*2+2).
OTHERWISE:
LINE 2:  The first two-marker data set (16 counts) should be entered on
one line in the following order: 
-- b a ab B Bb Ba Bab A Ab Aa Aab AB ABb ABa ABab
where AB is the haplotype linked to the mutant copy of the gene and ab
is the haplotype linked to the normal copy of the gene.  It is the
user's choice which marker is the A marker and which is the B marker, 
but that choice should
be consistent between the data file and the model file for each
two-marker data set.
LINE 3:  The 3 recombination fractions among the markers and the gene
for the first two-marker data set should be entered on the third line.
First comes the recombination fraction between marker A and the gene,
then the recombination fraction between marker B and the gene, then the
recombination fraction between markers A and B.  All recombination
fractions must be between 0 and 1/2, with 0 allowed but not 1/2.  This 
parametrization allows for interference.
LINES 4 and 5: Just like lines 2 and 3, but for the second two-marker
data set.

...Continue this way up through LINE (n2mark*2+1).

LINE (n2mark*2+2):  This line should contain the number of one-marker
data sets.  Call this number n1mark.
IF THE NUMBER ON LINE n2mark*2+2 IS 0, THEN THAT IS THE END OF THE FILE,
OTHERWISE:
LINE (n2mark*2+3):  The first one-marker data set (4 counts) should be
entered on one line in the following order:
-- c C Cc
where C is the allele linked to the mutant copy of the gene and c is the
allele linked to the normal copy.
LINE (n2mark*2+4):  The recombination fraction between the marker and
the gene for the first one-marker data set should be entered on this
line.  It must be between 0 and 1/2, with 0 allowed but not 1/2.

...Continue this way up through LINE (n2mark*2+n1mark*2+2).

VI. MODEL FILE

See the example files model.test1, model.test2, and model.test3.

We consider 5 distinct classes of parameter: segregation parameters
(class denoted by s), probabilities of 1 sperm deposited (class
denoted by d1), probabilities of 2 sperm deposited (class denoted 
by d2), allele amplification probabilities (class denoted by a), allele
contamination probabilities (class denoted by c).  Each two-marker data
set involves 1 s parameter, 1 d1 parameter, 1 d2 parameter, 4 a
parameters, and 4 c parameters.  Each one-marker data set involves 1 s
parameter, 1 d1 parameter, 1 d2 parameter, 2 a parameters, and 2 c
parameters.  Some of these parameters may be assumed to be equal across
some of the data sets.  In the model file, each distinct parameter to 
be estimated will be assigned an index number by the user, and that
index will be used to keep track of which parameters are shared in
common by which data sets.

The model file consists of two main parts.  In the first part, for each
data set, we consider the parameters associated with that data set and
associate an index with each parameter of each data set.  Parameters 
within
the same class (i.e. s, d1, d2, a, or c) that have the same index 
associated with them are assumed to
be set equal under the model.  Parameters with different indices
assigned to them are assumed to be distinct.  Parameters from different
classes can have the same index assigned to them, but they will not be
assumed equal, i.e. common indices have meaning only within classes, not
across different classes.

In the second part of the model file, we allow for some parameters to be 
set equal to fixed values.  

(A) IF THERE ARE BOTH TWO-MARKER AND ONE-MARKER DATA SETS:

LINE 1:  For each two-marker data set, enter an index indicating which
segregation parameter should be used.  Each index is an integer between
1 and (n2mark+n1mark) inclusive, where n2mark is the number of two-marker
data sets and n1mark is the number of one-marker data sets.  (The values
of n2mark and n1mark were entered in the data file).  These 
indices will be used to label the
output parameter estimates.  If two or more data sets are given the same 
index, then they will be assumed to have equal segregation probabilities.
The indices must be entered in the same order that the data sets were
entered in the data file.
For instance, if there are 3 two-marker data sets, and you want to
assume that the first two data sets entered in the data file have the
same segregation parameter while the third is different, line 1 could
read:
1 1 2
(or 2 2 3 or 3 3 1 etc.)  The labels themselves are fairly arbitrary.
The important thing is which data sets are assumed to have equal
segregation parameters.
LINE 2: Do the same as above but for the one-marker data sets.  Note:
if a one-marker data set and a two-marker data set have the same index
assigned to their segregation parameters, 
their segregation parameters will be assumed to be equal.
LINE 3:  For each two-marker data set, enter an index indicating which
d1 parameter (=probability of 1 sperm deposited) should be used.  This 
is the same idea as line 1, but for the d1 deposit parameters.  Note
that there will be no connection between the indices chosen for
segregation parameters and the indices chosen for deposit parameters,
e.g. a segregation parameter assigned index 1 will have no relationship
with a d1 parameter assigned index 1.  However, all data sets assigned
the same d1 index will be assumed to have the same d1 parameter.  The
data sets must be taken in the same order as in the data file.  Each 
index must be
an integer between 1 and (n2mark+n1mark) inclusive, as above.
LINE 4: Do line 3 for the one-marker data sets.
LINE 5:  Do line 3 but for the d2 parameter (= probability of 2 sperm
deposited).  Note that there will be no connection between the indices
chosen for the d2 parameters and those chosen for parameters in any
other class.
LINE 6:  Do line 4 but for the d2 parameter.
LINE 7:  For the first two-marker data set, for each of the 4 alleles
enter an index indicating which amplification parameter should be used.
Each index is an integer between 1 and (4*n2mark+2*n1mark) inclusive.
These indices will be used to label the
output parameter estimates.  If two or more alleles from among any of
the one- and two-marker data sets are given the same index, then they
will be assumed to have equal amplification probabilities.
The order of the indices entered should correspond to the following
order of alleles: A B a b, where the labels A, B, a,
and b are the same as those used for this data set in constructing
the data file.  Recall, AB is the haplotype linked to the mutant 
version of the gene, ab is linked to the non-mutant version).  The
choice of which is the A/a marker and which is the B/b marker was made
by the user in constructing the data file. 
LINES 8 THROUGH (6+n2mark): Same as line 7, but for each of the
other two-marker data sets, in turn (separate line for each data set).
LINES (7+n2mark) THROUGH (6+n2mark+n1mark): For each one-marker data
set (separate line for each data set), for each of the 2 alleles enter 
an index indicating which amplification parameter should be used, as 
was done for the two-marker data sets.  The order of the indices should
correspond to the following order of alleles: C c, where C is linked to
the mutant version of the gene and c is linked to the non-mutant
version.
LINES (7+n2mark+n1mark) THROUGH (6+2*n2mark+n1mark): Do the same as
lines 7 through (6+n2mark), but for the contamination parameters instead
of the amplification parameters.
LINES (7+2*n2mark+n1mark) THROUGH (6+2*n2mark+2*n1mark): Do the same as
lines (7+n2mark) THROUGH (6+n2mark+n1mark), but for the contamination
parameters instead of the amplification parameters.
LINE (7+2*n2mark+2*n1mark): Specify the number of parameters that will
be set equal to fixed values.  This is an integer between 0 and
(11*n2mark + 7*n1mark), inclusive.  Call it nfix.
LINES (8+2*n2mark+2*n1mark) THROUGH (7+2*n2mark+2*n1mark+nfix):
Each line corresponds to one parameter that is set to a fixed value.
These lines can occur in any order.  On each line the class of parameter
to be fixed (s, d1, d2, a, or c) comes first.  Then comes the index of 
the
parameter to be fixed, where the indices are as given in the beginning
part of the file.  Finally comes the value to which the parameter is to 
be fixed.  For instance, suppose there are 5 two-marker data sets and the
last 3 of them are to have their segregation parameters set equal to 1/2
while the first two are to have freely-varying segregation parameters.
Then line 1 of the file could be
1 2 3 3 3
indicating that the last 3 two-marker data sets
share segregation parameter s_3,
and in the fixed parameter part of the file, there would be another line
s 3 .5
indicating that s_3 is to be set equal to .5.

(B) IF THERE ARE NO ONE-MARKER DATA SETS:

LINES 2, 4, 6, and so on as given above do not appear in the file.  
Thus, the second line of the file would be what is labeled LINE 3 above.
The third line of the file would be what is labeled LINE 5 above, and so
on.

(C) IF THERE ARE NO TWO-MARKER DATA SETS:

Keep in mind that only very restricted models will even be identifiable
in this case.  See the brief discussion on IDENTIFIABILITY below under
WARNINGS.

LINES 1, 3, 5, and so on as given above do not appear in the file.  Thus,
the first line of the file would be what is labeled LINE 2 above.  The
second line of the file would be what is labeled LINE 4 above, and so
on.

VII. OPTIONS FILE

See the example files options.test2 and options.test3.

LINE 1:  The first line
tells the program what to do about confidence intervals.  A "0" on the
first line indicates that no confidence intervals are to be calculated.
A "1" followed by a decimal number tells the
program to calculate confidence intervals only for the segregation
parameters, and that the confidence intervals should be of the
level specified (e.g. line 3 would be
1 .95
to get 95% confidence intervals for the segregation parameters only).
A "2" followed by a decimal number tells the
program to calculate confidence intervals for all of the estimated
parameters, and that the confidence intervals should be of the
level specified.  The level must always be greater than 0 and less than
1.
Note that calculating confidence intervals is time-consuming.
LINE 2:  On line 2 should be a 0, 1, or 2.  Each two-marker data set 
consists of 16 observed cell counts, and
each one-marker data set consists of 4 observed cell counts.  Line 2 tells 
the program whether to output expected cell counts under the fitted 
model (if 2 is input) or cell probabilities under the fitted model
(if 1 is input) or neither (if 0 is
input).  If a model does not fit the data, this information may be
useful to help figure out more specifically in what way the data
contradict the model.  Note that the log-likelihood under a model is
equal to the sum over all cells of (observed cell count)*log(cell
probability under the fitted model).
LINE 3:  A 1 on this line indicates that the log-likelihoods under the
model should be output separately for each individual data set.  A 0 on
this line indicates that only the overall log-likelihood will be
output.
LINE 4:  A 1 on this line indicates that the log-likelihood for
the full multinomial model should be output.  This is equal to sum of
(observed count)*log((observed count)/(total count for that data set)).
A 0 on this line indicates that it will not be output.
LINE 5:  To skip this option, enter a 0 on this line.  On this line, 
information is given about how many search 
steps should be used to find confidence intervals.  Confidence intervals
are found by inverting the likelihood ratio test, which is done by a
search.  No matter how many steps are used in the search, the confidence
interval should be conservative.  As more steps are used, the confidence
interval becomes smaller and closer to the level specified by the user.
However, with more steps, the procedure can become very time-consuming.
The default number of steps is 10 (meaning 10 steps to find the upper
confidence limit and 10 steps to find the lower confidence limit for
each parameter).  Using this number, the right or left
edge of the confidence interval should not be more than .001 off the
true value, and the interval should have level of confidence at least as
high as that specified by the user.  To use the default value of 10
steps, the user should enter a 0 on this line.  There are actually three
different parameters representing numbers of steps used in finding 
confidence intervals: nstep1, nstep2, and nstep3, all three of which 
default to 10.  nstep1 is the number of steps used for the s, d1, and a
parameters, which do not tend to maximize very close to the boundary of
the parameter space.  nstep2 is the number of steps used to find the
upper confidence limits and nstep3 the number of steps used to find the
lower confidence limits for d2 and c, which tend to be very close to 0.
One might wish to reduce nstep3 and increase nstep2, since nstep2
is used to search a much larger region of parameter space than nstep3.
It is reasonable that nstep1 should be somewhere in between nstep2
and nstep3.  To specify nstep1, nstep2, and nstep 3, the user 
should enter a 1 on this line followed by 3 nonnegative
integers on the same line.
LINE 6:  The sixth line gives information about whether the user will be
inputing starting values for the EM algorithm or whether the defaults
should be used.  A "0" indicates that the defaults should be used.
A "1" followed by a name of a starting value file indicates that the
user has specified starting values, for some or all parameters, in the
named starting value file.
LINE 7:  The seventh line gives information about whether a simulation 
should be done to assess goodness of fit of the fitted model to the 
data.  Note
that this is time-consuming.  If no simulation is desired, enter a 0 on
this line.  If simulation is desired, enter a 1 followed by the number
of realizations in the simulation, call this number nsim.  The
simulations test the null hypothesis that the data are generated by the
fitted model against the alternative of the full multinomial model for
the data.  In principle, a likelihood-ratio chi-square test could simply
be performed.  However, in the sperm-typing scenario, it is often hard
to determine what are the appropriate degrees of freedom to use for the
chi-square test, because there are often many parameters maximizing on
the boundary.  We can instead simulate the distribution of the
likelihood ratio statistic under the null to obtain p-values.  The 
simulation proceeds as follows:
After the likelihood is maximized, the cell probabilities for all the
data cells under the fitted model are generated.  Then nsim realizations
of the entire data set are generated by independent multinomial models 
with these cell probabilities as the multinomial probabilities, and with
the sample sizes observed in the real data sets as the sample sizes
within each simulation.  That is, taking the fitted model as the true
model, we generate nsim simulated realizations of the data set.  For
each realization, we refit the model initially entered by the user,
maximizing the likelihood for the simulated data, and obtaining the 
maximized
log-likelihood.  We also calculate the full-multinomial log-likelihood
for each simulated data set.  We compare the observed difference between
these two log-likelihoods for the real data with the differences
obtained in the simulation.  The estimated p-value is the proportion of
simulations for which this difference exceeds the difference observed in
the data.  This allows us to test the null hypothesis
that the model initially specified by the user fits the data against the
alternative of the full multinomial model. 

Note:  If the options file has zeroes at the beginnings of each of 
the first 7 lines, the result will be the same as if no options are 
specified in the input file.

VIII. STARTING VALUE FILE

See the example file start.test3. 

The default starting values used for the EM algorithm are .5 for s 
parameters, .85 for d1
parameters, .05 for d2 parameters, .8 for a parameters, and .005 for c
parameters.  The user can instead provide starting values for some 
or all of the
parameters if desired.  The format of the input is basically the same 
as for the second part of the model file, where parameters could be 
set to fixed values.

LINE 1: Specify the number of parameters for which you will provide
starting values.  This is an integer between 1 and (11*n2mark +
7*n1mark), inclusive.  Call it nstart.
LINES 2 through (nstart+1): Each line corresponds to one parameter for
which a starting value is provided.  These lines can occur in any order.
On each line the class of parameter (s, d1, d2, a, or c) comes first.
Then comes the index of the parameter, where the indices are as given in
the model file.  Finally comes the starting value for that parameter.

Notes:  
1)  If a parameter is fixed in the model file, then any starting value
assigned to it in the starting value file will be ignored.

2)  For each data set, the d1 and d2 parameters must sum to no more than
1.  The program automatically checks for compliance with this when fixed
values and starting values are input by the user.

IX. RESTRICTIONS ON MODELS:

There are some slight restrictions on the models
that can be fit by maximum likelihood using SPERMSEG.  These
restrictions affect only the deposit parameters.  I consider it 
unlikely that a user would try to fit a model that violated these
restrictions.  The restrictions are as follows:

Suppose d1_i and d2_j appear together in at least one data set.  That 
is, suppose there is a data set for which the index of the probability 
of 1 sperm deposited is assigned to be i and the index of the 
probability of 2 sperm deposited is assigned to be j.  Then we require 
that at least one of the following three conditions, 1, 2, or 3, hold:

1.  Any other data sets having d1 parameter d1_i have d2 parameter
d2_j.  Conversely, any data sets having d2 parameter d2_j have d1
parameter d1_i.

or

2.  Any other data sets having d1 parameter d1_i have d2 parameter d2_j.
Consider all data sets having d2 parameter d2_j, and consider the set of
d1 parameters included in these data sets.  Let I be the set of indices
of d1 parameters that appear in a data set with d2_j.  Then for each k
in I, any data set having d1 parameter d1_k must have d2 parameter d2_j.
Furthermore, at least one of the following must hold:
	(i) d2_j is fixed;
	(ii) no more than one of the d1 parameters with indices in I 
	is fixed.  

or

3.  This is the same as condition 2, but with the roles of d1 and d2
reversed.  Any other data sets having d2 parameter d2_j have d1
parameter d1_i.  Consider all data sets having d1 parameter d1_i, and
consider the set of d2 parameters included in these data sets.  Let J be
the set of indices of d2 parameters that appear in a data set with d1_i.
Then for each k in J, any data set having d2 parameter d2_k must have d1
parameter d1_i.  Furthermore, at least one of the following must hold:
	(i) d1_i is fixed;
	(ii) no more than one of the d2 parameters with
	indices in J may be fixed.  

For each d1_i and d2_j appearing in the same data set, we require at 
least one of these three conditions to hold.  The reason for these
restrictions has to do with the fact that the deposit parameters for a
given data set must be between 0 and 1 and must sum to no more than 1.
This gives constraints under which the likelihood is maximized, and
these constraints become very complicated if these conditions do not 
hold.

Another type of restriction on models comes from considerations of
identifiability.  See number 3 below under WARNINGS GENERATED BY
SPERMSEG.

X. ERROR MESSAGES:

SPERMSEG has extensive error-checking of input.  Error messages are very
explicit and should hopefully be self-explanatory.


XI. WARNINGS GENERATED BY SPERMSEG:

1. Fixing a value for a parameter twice, or fixing a starting value for
a parameter twice.  

SPERMSEG allows the user to set some or all parameters to fixed values 
and to provide starting values for the EM algorithm for some or all 
parameters.  In each case, the user provides a list of parameters and
values.  If the same parameter appears more than once on one of these 
lists, SPERMSEG will generate a warning saying which parameter 
is involved, what was the previous value, and what is the new value.
The program will then run normally using whatever was the last value in 
the list for that parameter.

2. Estimation of parameters on the boundary

All of the parameters in the sperm-typing model are probabilities, so
they are constrained to lie between 0 and 1.  The program calculates
the maximum likelihood estimates of the parameters subject to these
constraints.  It may be that at the maximum, some parameters take the
values 0 or 1.  These parameters are said to be estimated on the
boundary.  Using the EM algorithm, these parameter estimates will come
arbitrarily close to the boundary, though they will not actually reach
it exactly, just because of the way the EM algorithm works.  If a 
parameter estimate is within 1.e-10 of a boundary using the
EM algorithm, SPERMSEG will note that the parameter apparently maximized
on the boundary and generate a warning.  (To double check that the
parameter does actually maximize on the boundary, one could go into 
the model file and fix the value of that parameter to its boundary 
value.  Then refit the model and compare the log-likelihoods.  This is
generally not necessary, though.)

Estimation of some parameters on the boundary tends to happen when a lot
of parameters are estimated relative to the sample size or information
available in the data.  
Using a more parsimonious model or, if possible, increasing the sample 
size are two of the ways to stop this from happening.

The reason for the warning is that estimation of parameters on the
boundary can affect the likelihood ratio chi-square test.  An assumption
of that test is that the parameters are not estimated on the boundary.
In practice, if the same parameters are estimated on the boundary in
both the null and alternative models, then the likelihood ratio
chi-square test is probably still a reasonable thing to do.

In sperm-typing data sets, it is common to have some parameters (almost
always contamination parameters or probabilities of 2 sperm deposited)
estimated on the boundary.  The hypothesis tests for which one usually
wants to use the likelihood ratio chi-square involve setting some or all
of the segregation parameters equal to each other or equal to 1/2.
These models usually all have the same parameters estimated on the
boundary, so the likelihood ratio chi-square is still reasonable.
Alternatively, a model in which some or all contamination parameters 
are set
equal (and perhaps some or all probabilities of 2 sperm deposited are set
equal) may eliminate the problem, assuming such a model fits the data.

(The chi-square goodness-of-fit test is an example of a test that one may
often want to perform and that will almost always have different numbers
of parameters maximizing on the boundary between the null and
alternative models.  To get around this problem, a built-in simulation
to assess goodness of fit is included in SPERMSEG as described above
for LINE 7 under OPTIONS FILE.)

3. Identifiability warning

Lack of identifiability of a model means that different combinations of
parameter values can generate the same probability distribution for the
data.  This often results in a situation where
the parameters of the model cannot be estimated from the data.  A
trivial example would be if the data consisted of a single one-marker data
set, and the model were the full sperm-typing model for a single marker
locus, which has 7 parameters (1 segregation, 2 deposit, 2
amplification, 2 contamination).
The single one-marker data set consists of 4 counts, 3 of which are
freely varying, so clearly 7 parameters cannot be estimated.
Identifiability problems can be more subtle.  It may be that there are 3
freely varying counts and 3 parameters in the model, but the model may
still not be identifiable because the data do not contain information
needed to distinguish among different possible combinations of the
parameter values.  In general, identifiability or non-identifiability of
a model may be difficult to prove.  To learn more about identifiability,
see an intermediate statistics textbook such as Casella and Berger.

When does SPERMSEG generate the warning?

The 11 parameters of the full two-marker
sperm-typing model can be estimated from
a single two-marker data set.  The SPERMSEG program checks to see that
each parameter used for a one-marker data set also is used for a
two-marker data set.  In that case, the model is identifiable, and no
warning is generated.  Another common case that is recognized by the
program as identifiable and that will not generate a
warning is if the only parameters used in one-marker data sets that are
not also used in two-marker data sets are deposit parameters.  For all
other types of parameters, if there is at least one parameter being
estimated that appears in no two-marker data set, then SPERMSEG will
generate the identifiability warning.  The program will still perform
the analysis.  Generation of the warning does not mean that the model is
not identifiable.  It merely means that the program cannot verify
identifiability.

What should you do when you get the warning?

First check to see that you entered your model correctly.  Next, try to
verify identifiability.  In some cases you may be able to use the fact
that you can't estimate more parameters in a data set than you have
degrees of freedom.  However, that can only confirm non-identifiability,
not identifiability.  One practical way to investigate possible problems
with identifiability is to try different starting points in the
algorithm, using the starting points option described above.  A
characteristic feature of non-identifiability is that different starting
points lead to different estimates, though this can be due to other
causes as well.  In general, identifiability of a model requires a
mathematical proof.  The bottom line is that if you are unable to verify
that a model is identifiable, then you should not use that model.

XII. EXAMPLE DATA SETS/FILES:

The software comes with three examples to demonstrate the input and
output of the program.

Example 1:  This is almost the simplest possible example.
It consists of a single two-marker data set in which marker A/a is
completely linked to the gene and marker B/b is at recombination
fraction .01 (see data.test1).  The model fit to this data set (see
model.test1) is the full sperm-typing
model with 11 parameters: s, d1, d2, 4 amplification parameters (1 each
for alleles A, a, B, and b), and 4 contamination parameters (1 each for
alleles A, a, B, and b).  No additional options are requested.  The
files are:

input.test1
data.test1
model.test1

Output for this example can be found in the file output.test1.

Example 2:  This is a variation on example 1 where the model for the
data in the file model.test2 has locus-specific amplification and 
contamination parameters,
i.e. a and A are assumed to have equal amplification rates and equal
contamination rates, and b and B are assumed to have equal amplification
rates and equal contamination rates distinct from a/A's.  Here some
special options are used in the file options.test2: a 95% confidence 
interval for the segregation
parameter is calculated.  (By changing the first number in the file
options.test2 from a "1" to a "2", one could get 95% confidence
intervals for all estimated parameters, but that is more
time-consuming.)  The program outputs expected cell counts, i.e. the
number of each of the 16 types of observations that would be expected
under the fitted model.  The program also outputs the log-likelihood 
under the full multinomial model.  (See OPTIONS above for more details.)
The files are:

input.test2
data.test1
model.test2
options.test2

Output for this example can be found in the file output.test2.

Example 3:  In this example, there are 2 donors.  Donor 1 has a 2-marker
data set (call it 2-marker data set 1) in which markers M1 and M2 are
typed, where M1 is completely linked to the gene and M2 is at
recombination fraction .01.  Donor 1 has another 2-marker data set (call
it 2-marker data set 2) in which markers M1 and M3 are typed, where M1
is the same as before and M3 is at recombination fraction .02 from the
gene.  Donor 2 has a 2-marker data set (call it 2-marker data set 3) in
which markers M2 and M3 are typed, and donor 2 also has a 1-marker data
set (call it 1-marker data set 1) in which only marker M2 is typed.  See
data.test3.  The model fit to this data set (see model.test3) has
segregation parameters set equal to 1/2, experiment-specific d1
parameters, and donor-specific d2 parameters.  Each donor has his own
allele-specific amplification and contamination parameters.  Some
special options are used in the file options.test3: cell probabilities
under the fitted model are given, a separate log-likelihood is
calculated for each data set, log-likelihoods under the full multinomial
model are calculated, and starting values for the EM algorithm are
specified in the file start.test3 for some of the parameters (for the
remaining parameters, defaults are used).  (See OPTIONS above for more
details.)  The files are:

input.test3
data.test3
model.test3
options.test3
start.test3

Output for this example can be found in the file output.test3.