SPERMSEG: analysis of segregation distortion in single sperm data Copyright (C) 1999 Mary Sara McPeek ========================================================= LICENSE This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY of FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program (see file gpl.txt); if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. I request that use of this software be cited in publications as M.S. McPeek (1999) "SPERMSEG: analysis of segregation distortion in single-sperm data" American Journal of Human Genetics 65:1195-1197. To contact the author: Mary Sara McPeek Department of Statistics 5734 S. University Ave. Chicago, IL 60637 email: mcpeek@galton.uchicago.edu ======================================================== TABLE OF CONTENTS I. COMPILING SPERMSEG II. RUNNING SPERMSEG III. GENERAL NOTES IV. INPUT FILE V. DATA FILE VI. MODEL FILE VII. OPTIONS FILE VIII. STARTING VALUE FILE IX. RESTRICTIONS ON MODELS X. ERROR MESSAGES XI. WARNINGS GENERATED BY THE PROGRAM XII. EXAMPLE DATA SETS/FILES ====================================== I. COMPILING SPERMSEG All you really need is spermseg.tar.gz. Uncompress it using the command "gunzip spermseg.tar.gz". Then enter the directory where you want to keep the program and issue the command "tar xvf spermseg.tar". If you have problems using gunzip or tar, you can just take the individual files from this directory. Edit the Makefile as necessary according to the instructions in the Makefile. You should only need to make sure that the correct compiler and compiler options for your machine are chosen. To create the executable file, within the same directory as these files issue the command "make". The program has been developed and tested on the following platforms: Solaris (5.6 and 5.7) SGI IRIX (6.4) Linux (Red Hat 5.2) The code was written to be highly portable. If you have problems compiling on your system, you should first talk to your system administrator who will be most familiar with importing software to your system. If you have comments or questions, you can send them to mcpeek@galton.uchicago.edu II. RUNNING SPERMSEG The name of the executable file is spermseg. To run the program, make sure the executable file, input file, data file, model file (and, if used, options file and starting value file) are all in the same directory. In that directory, type spermseg, followed by the name of the input file, on the command line. For instance, to run the program using the first example input provided, enter the SPERMSEG directory and type spermseg input.test1 III. GENERAL NOTES on the input, data, model, options, and starting value files: The information in these files is read in line by line, with extensive error checking. In particular, if the program encounters a blank line where it is expecting some information in one of these files, or if it reaches the end of a line without finding all the information expected on that line, you will get an explicit error message, telling exactly where the problem occurred. Any amount of white space can separate items on the same line. Once the program has found all the information it is expecting on a certain line, the rest of the line is ignored, so comments can be made there. In the example files, comments are indicated by the symbols <<. IV. INPUT FILE See the example files input.test1, input.test2, and input.test3. LINE 1: The input file must have the name of the data file on the first line. LINE 2: The name of the model file goes on the second line. LINE 3: The third line tells the program whether any special options will be used. The available options fall into 7 categories: confidence intervals, output of cell probabilities or expected cell counts for the data, output of separate log-likelihoods under the model for the individual data sets, calculation of log-likelihood of the data under full multinomial model, user-specified number of search steps for creation of confidence intervals, user-specified starting values for the EM algorithm, simulation to assess goodness of fit of the chosen model to the data. If none of these options are desired, enter a "0" on the third line. If one or more options are desired, enter a "1" followed by the name of the options file. Details on these options are given below under OPTIONS FILE. V. DATA FILE See the example files data.test1 and data.test3. LINE 1: The data file must have the number of two-marker data sets on the first line. Call this number n2mark. IF THE NUMBER ON LINE 1 IS 0, SKIP DOWN TO LINE (n2mark*2+2). OTHERWISE: LINE 2: The first two-marker data set (16 counts) should be entered on one line in the following order: -- b a ab B Bb Ba Bab A Ab Aa Aab AB ABb ABa ABab where AB is the haplotype linked to the mutant copy of the gene and ab is the haplotype linked to the normal copy of the gene. It is the user's choice which marker is the A marker and which is the B marker, but that choice should be consistent between the data file and the model file for each two-marker data set. LINE 3: The 3 recombination fractions among the markers and the gene for the first two-marker data set should be entered on the third line. First comes the recombination fraction between marker A and the gene, then the recombination fraction between marker B and the gene, then the recombination fraction between markers A and B. All recombination fractions must be between 0 and 1/2, with 0 allowed but not 1/2. This parametrization allows for interference. LINES 4 and 5: Just like lines 2 and 3, but for the second two-marker data set. ...Continue this way up through LINE (n2mark*2+1). LINE (n2mark*2+2): This line should contain the number of one-marker data sets. Call this number n1mark. IF THE NUMBER ON LINE n2mark*2+2 IS 0, THEN THAT IS THE END OF THE FILE, OTHERWISE: LINE (n2mark*2+3): The first one-marker data set (4 counts) should be entered on one line in the following order: -- c C Cc where C is the allele linked to the mutant copy of the gene and c is the allele linked to the normal copy. LINE (n2mark*2+4): The recombination fraction between the marker and the gene for the first one-marker data set should be entered on this line. It must be between 0 and 1/2, with 0 allowed but not 1/2. ...Continue this way up through LINE (n2mark*2+n1mark*2+2). VI. MODEL FILE See the example files model.test1, model.test2, and model.test3. We consider 5 distinct classes of parameter: segregation parameters (class denoted by s), probabilities of 1 sperm deposited (class denoted by d1), probabilities of 2 sperm deposited (class denoted by d2), allele amplification probabilities (class denoted by a), allele contamination probabilities (class denoted by c). Each two-marker data set involves 1 s parameter, 1 d1 parameter, 1 d2 parameter, 4 a parameters, and 4 c parameters. Each one-marker data set involves 1 s parameter, 1 d1 parameter, 1 d2 parameter, 2 a parameters, and 2 c parameters. Some of these parameters may be assumed to be equal across some of the data sets. In the model file, each distinct parameter to be estimated will be assigned an index number by the user, and that index will be used to keep track of which parameters are shared in common by which data sets. The model file consists of two main parts. In the first part, for each data set, we consider the parameters associated with that data set and associate an index with each parameter of each data set. Parameters within the same class (i.e. s, d1, d2, a, or c) that have the same index associated with them are assumed to be set equal under the model. Parameters with different indices assigned to them are assumed to be distinct. Parameters from different classes can have the same index assigned to them, but they will not be assumed equal, i.e. common indices have meaning only within classes, not across different classes. In the second part of the model file, we allow for some parameters to be set equal to fixed values. (A) IF THERE ARE BOTH TWO-MARKER AND ONE-MARKER DATA SETS: LINE 1: For each two-marker data set, enter an index indicating which segregation parameter should be used. Each index is an integer between 1 and (n2mark+n1mark) inclusive, where n2mark is the number of two-marker data sets and n1mark is the number of one-marker data sets. (The values of n2mark and n1mark were entered in the data file). These indices will be used to label the output parameter estimates. If two or more data sets are given the same index, then they will be assumed to have equal segregation probabilities. The indices must be entered in the same order that the data sets were entered in the data file. For instance, if there are 3 two-marker data sets, and you want to assume that the first two data sets entered in the data file have the same segregation parameter while the third is different, line 1 could read: 1 1 2 (or 2 2 3 or 3 3 1 etc.) The labels themselves are fairly arbitrary. The important thing is which data sets are assumed to have equal segregation parameters. LINE 2: Do the same as above but for the one-marker data sets. Note: if a one-marker data set and a two-marker data set have the same index assigned to their segregation parameters, their segregation parameters will be assumed to be equal. LINE 3: For each two-marker data set, enter an index indicating which d1 parameter (=probability of 1 sperm deposited) should be used. This is the same idea as line 1, but for the d1 deposit parameters. Note that there will be no connection between the indices chosen for segregation parameters and the indices chosen for deposit parameters, e.g. a segregation parameter assigned index 1 will have no relationship with a d1 parameter assigned index 1. However, all data sets assigned the same d1 index will be assumed to have the same d1 parameter. The data sets must be taken in the same order as in the data file. Each index must be an integer between 1 and (n2mark+n1mark) inclusive, as above. LINE 4: Do line 3 for the one-marker data sets. LINE 5: Do line 3 but for the d2 parameter (= probability of 2 sperm deposited). Note that there will be no connection between the indices chosen for the d2 parameters and those chosen for parameters in any other class. LINE 6: Do line 4 but for the d2 parameter. LINE 7: For the first two-marker data set, for each of the 4 alleles enter an index indicating which amplification parameter should be used. Each index is an integer between 1 and (4*n2mark+2*n1mark) inclusive. These indices will be used to label the output parameter estimates. If two or more alleles from among any of the one- and two-marker data sets are given the same index, then they will be assumed to have equal amplification probabilities. The order of the indices entered should correspond to the following order of alleles: A B a b, where the labels A, B, a, and b are the same as those used for this data set in constructing the data file. Recall, AB is the haplotype linked to the mutant version of the gene, ab is linked to the non-mutant version). The choice of which is the A/a marker and which is the B/b marker was made by the user in constructing the data file. LINES 8 THROUGH (6+n2mark): Same as line 7, but for each of the other two-marker data sets, in turn (separate line for each data set). LINES (7+n2mark) THROUGH (6+n2mark+n1mark): For each one-marker data set (separate line for each data set), for each of the 2 alleles enter an index indicating which amplification parameter should be used, as was done for the two-marker data sets. The order of the indices should correspond to the following order of alleles: C c, where C is linked to the mutant version of the gene and c is linked to the non-mutant version. LINES (7+n2mark+n1mark) THROUGH (6+2*n2mark+n1mark): Do the same as lines 7 through (6+n2mark), but for the contamination parameters instead of the amplification parameters. LINES (7+2*n2mark+n1mark) THROUGH (6+2*n2mark+2*n1mark): Do the same as lines (7+n2mark) THROUGH (6+n2mark+n1mark), but for the contamination parameters instead of the amplification parameters. LINE (7+2*n2mark+2*n1mark): Specify the number of parameters that will be set equal to fixed values. This is an integer between 0 and (11*n2mark + 7*n1mark), inclusive. Call it nfix. LINES (8+2*n2mark+2*n1mark) THROUGH (7+2*n2mark+2*n1mark+nfix): Each line corresponds to one parameter that is set to a fixed value. These lines can occur in any order. On each line the class of parameter to be fixed (s, d1, d2, a, or c) comes first. Then comes the index of the parameter to be fixed, where the indices are as given in the beginning part of the file. Finally comes the value to which the parameter is to be fixed. For instance, suppose there are 5 two-marker data sets and the last 3 of them are to have their segregation parameters set equal to 1/2 while the first two are to have freely-varying segregation parameters. Then line 1 of the file could be 1 2 3 3 3 indicating that the last 3 two-marker data sets share segregation parameter s_3, and in the fixed parameter part of the file, there would be another line s 3 .5 indicating that s_3 is to be set equal to .5. (B) IF THERE ARE NO ONE-MARKER DATA SETS: LINES 2, 4, 6, and so on as given above do not appear in the file. Thus, the second line of the file would be what is labeled LINE 3 above. The third line of the file would be what is labeled LINE 5 above, and so on. (C) IF THERE ARE NO TWO-MARKER DATA SETS: Keep in mind that only very restricted models will even be identifiable in this case. See the brief discussion on IDENTIFIABILITY below under WARNINGS. LINES 1, 3, 5, and so on as given above do not appear in the file. Thus, the first line of the file would be what is labeled LINE 2 above. The second line of the file would be what is labeled LINE 4 above, and so on. VII. OPTIONS FILE See the example files options.test2 and options.test3. LINE 1: The first line tells the program what to do about confidence intervals. A "0" on the first line indicates that no confidence intervals are to be calculated. A "1" followed by a decimal number tells the program to calculate confidence intervals only for the segregation parameters, and that the confidence intervals should be of the level specified (e.g. line 3 would be 1 .95 to get 95% confidence intervals for the segregation parameters only). A "2" followed by a decimal number tells the program to calculate confidence intervals for all of the estimated parameters, and that the confidence intervals should be of the level specified. The level must always be greater than 0 and less than 1. Note that calculating confidence intervals is time-consuming. LINE 2: On line 2 should be a 0, 1, or 2. Each two-marker data set consists of 16 observed cell counts, and each one-marker data set consists of 4 observed cell counts. Line 2 tells the program whether to output expected cell counts under the fitted model (if 2 is input) or cell probabilities under the fitted model (if 1 is input) or neither (if 0 is input). If a model does not fit the data, this information may be useful to help figure out more specifically in what way the data contradict the model. Note that the log-likelihood under a model is equal to the sum over all cells of (observed cell count)*log(cell probability under the fitted model). LINE 3: A 1 on this line indicates that the log-likelihoods under the model should be output separately for each individual data set. A 0 on this line indicates that only the overall log-likelihood will be output. LINE 4: A 1 on this line indicates that the log-likelihood for the full multinomial model should be output. This is equal to sum of (observed count)*log((observed count)/(total count for that data set)). A 0 on this line indicates that it will not be output. LINE 5: To skip this option, enter a 0 on this line. On this line, information is given about how many search steps should be used to find confidence intervals. Confidence intervals are found by inverting the likelihood ratio test, which is done by a search. No matter how many steps are used in the search, the confidence interval should be conservative. As more steps are used, the confidence interval becomes smaller and closer to the level specified by the user. However, with more steps, the procedure can become very time-consuming. The default number of steps is 10 (meaning 10 steps to find the upper confidence limit and 10 steps to find the lower confidence limit for each parameter). Using this number, the right or left edge of the confidence interval should not be more than .001 off the true value, and the interval should have level of confidence at least as high as that specified by the user. To use the default value of 10 steps, the user should enter a 0 on this line. There are actually three different parameters representing numbers of steps used in finding confidence intervals: nstep1, nstep2, and nstep3, all three of which default to 10. nstep1 is the number of steps used for the s, d1, and a parameters, which do not tend to maximize very close to the boundary of the parameter space. nstep2 is the number of steps used to find the upper confidence limits and nstep3 the number of steps used to find the lower confidence limits for d2 and c, which tend to be very close to 0. One might wish to reduce nstep3 and increase nstep2, since nstep2 is used to search a much larger region of parameter space than nstep3. It is reasonable that nstep1 should be somewhere in between nstep2 and nstep3. To specify nstep1, nstep2, and nstep 3, the user should enter a 1 on this line followed by 3 nonnegative integers on the same line. LINE 6: The sixth line gives information about whether the user will be inputing starting values for the EM algorithm or whether the defaults should be used. A "0" indicates that the defaults should be used. A "1" followed by a name of a starting value file indicates that the user has specified starting values, for some or all parameters, in the named starting value file. LINE 7: The seventh line gives information about whether a simulation should be done to assess goodness of fit of the fitted model to the data. Note that this is time-consuming. If no simulation is desired, enter a 0 on this line. If simulation is desired, enter a 1 followed by the number of realizations in the simulation, call this number nsim. The simulations test the null hypothesis that the data are generated by the fitted model against the alternative of the full multinomial model for the data. In principle, a likelihood-ratio chi-square test could simply be performed. However, in the sperm-typing scenario, it is often hard to determine what are the appropriate degrees of freedom to use for the chi-square test, because there are often many parameters maximizing on the boundary. We can instead simulate the distribution of the likelihood ratio statistic under the null to obtain p-values. The simulation proceeds as follows: After the likelihood is maximized, the cell probabilities for all the data cells under the fitted model are generated. Then nsim realizations of the entire data set are generated by independent multinomial models with these cell probabilities as the multinomial probabilities, and with the sample sizes observed in the real data sets as the sample sizes within each simulation. That is, taking the fitted model as the true model, we generate nsim simulated realizations of the data set. For each realization, we refit the model initially entered by the user, maximizing the likelihood for the simulated data, and obtaining the maximized log-likelihood. We also calculate the full-multinomial log-likelihood for each simulated data set. We compare the observed difference between these two log-likelihoods for the real data with the differences obtained in the simulation. The estimated p-value is the proportion of simulations for which this difference exceeds the difference observed in the data. This allows us to test the null hypothesis that the model initially specified by the user fits the data against the alternative of the full multinomial model. Note: If the options file has zeroes at the beginnings of each of the first 7 lines, the result will be the same as if no options are specified in the input file. VIII. STARTING VALUE FILE See the example file start.test3. The default starting values used for the EM algorithm are .5 for s parameters, .85 for d1 parameters, .05 for d2 parameters, .8 for a parameters, and .005 for c parameters. The user can instead provide starting values for some or all of the parameters if desired. The format of the input is basically the same as for the second part of the model file, where parameters could be set to fixed values. LINE 1: Specify the number of parameters for which you will provide starting values. This is an integer between 1 and (11*n2mark + 7*n1mark), inclusive. Call it nstart. LINES 2 through (nstart+1): Each line corresponds to one parameter for which a starting value is provided. These lines can occur in any order. On each line the class of parameter (s, d1, d2, a, or c) comes first. Then comes the index of the parameter, where the indices are as given in the model file. Finally comes the starting value for that parameter. Notes: 1) If a parameter is fixed in the model file, then any starting value assigned to it in the starting value file will be ignored. 2) For each data set, the d1 and d2 parameters must sum to no more than 1. The program automatically checks for compliance with this when fixed values and starting values are input by the user. IX. RESTRICTIONS ON MODELS: There are some slight restrictions on the models that can be fit by maximum likelihood using SPERMSEG. These restrictions affect only the deposit parameters. I consider it unlikely that a user would try to fit a model that violated these restrictions. The restrictions are as follows: Suppose d1_i and d2_j appear together in at least one data set. That is, suppose there is a data set for which the index of the probability of 1 sperm deposited is assigned to be i and the index of the probability of 2 sperm deposited is assigned to be j. Then we require that at least one of the following three conditions, 1, 2, or 3, hold: 1. Any other data sets having d1 parameter d1_i have d2 parameter d2_j. Conversely, any data sets having d2 parameter d2_j have d1 parameter d1_i. or 2. Any other data sets having d1 parameter d1_i have d2 parameter d2_j. Consider all data sets having d2 parameter d2_j, and consider the set of d1 parameters included in these data sets. Let I be the set of indices of d1 parameters that appear in a data set with d2_j. Then for each k in I, any data set having d1 parameter d1_k must have d2 parameter d2_j. Furthermore, at least one of the following must hold: (i) d2_j is fixed; (ii) no more than one of the d1 parameters with indices in I is fixed. or 3. This is the same as condition 2, but with the roles of d1 and d2 reversed. Any other data sets having d2 parameter d2_j have d1 parameter d1_i. Consider all data sets having d1 parameter d1_i, and consider the set of d2 parameters included in these data sets. Let J be the set of indices of d2 parameters that appear in a data set with d1_i. Then for each k in J, any data set having d2 parameter d2_k must have d1 parameter d1_i. Furthermore, at least one of the following must hold: (i) d1_i is fixed; (ii) no more than one of the d2 parameters with indices in J may be fixed. For each d1_i and d2_j appearing in the same data set, we require at least one of these three conditions to hold. The reason for these restrictions has to do with the fact that the deposit parameters for a given data set must be between 0 and 1 and must sum to no more than 1. This gives constraints under which the likelihood is maximized, and these constraints become very complicated if these conditions do not hold. Another type of restriction on models comes from considerations of identifiability. See number 3 below under WARNINGS GENERATED BY SPERMSEG. X. ERROR MESSAGES: SPERMSEG has extensive error-checking of input. Error messages are very explicit and should hopefully be self-explanatory. XI. WARNINGS GENERATED BY SPERMSEG: 1. Fixing a value for a parameter twice, or fixing a starting value for a parameter twice. SPERMSEG allows the user to set some or all parameters to fixed values and to provide starting values for the EM algorithm for some or all parameters. In each case, the user provides a list of parameters and values. If the same parameter appears more than once on one of these lists, SPERMSEG will generate a warning saying which parameter is involved, what was the previous value, and what is the new value. The program will then run normally using whatever was the last value in the list for that parameter. 2. Estimation of parameters on the boundary All of the parameters in the sperm-typing model are probabilities, so they are constrained to lie between 0 and 1. The program calculates the maximum likelihood estimates of the parameters subject to these constraints. It may be that at the maximum, some parameters take the values 0 or 1. These parameters are said to be estimated on the boundary. Using the EM algorithm, these parameter estimates will come arbitrarily close to the boundary, though they will not actually reach it exactly, just because of the way the EM algorithm works. If a parameter estimate is within 1.e-10 of a boundary using the EM algorithm, SPERMSEG will note that the parameter apparently maximized on the boundary and generate a warning. (To double check that the parameter does actually maximize on the boundary, one could go into the model file and fix the value of that parameter to its boundary value. Then refit the model and compare the log-likelihoods. This is generally not necessary, though.) Estimation of some parameters on the boundary tends to happen when a lot of parameters are estimated relative to the sample size or information available in the data. Using a more parsimonious model or, if possible, increasing the sample size are two of the ways to stop this from happening. The reason for the warning is that estimation of parameters on the boundary can affect the likelihood ratio chi-square test. An assumption of that test is that the parameters are not estimated on the boundary. In practice, if the same parameters are estimated on the boundary in both the null and alternative models, then the likelihood ratio chi-square test is probably still a reasonable thing to do. In sperm-typing data sets, it is common to have some parameters (almost always contamination parameters or probabilities of 2 sperm deposited) estimated on the boundary. The hypothesis tests for which one usually wants to use the likelihood ratio chi-square involve setting some or all of the segregation parameters equal to each other or equal to 1/2. These models usually all have the same parameters estimated on the boundary, so the likelihood ratio chi-square is still reasonable. Alternatively, a model in which some or all contamination parameters are set equal (and perhaps some or all probabilities of 2 sperm deposited are set equal) may eliminate the problem, assuming such a model fits the data. (The chi-square goodness-of-fit test is an example of a test that one may often want to perform and that will almost always have different numbers of parameters maximizing on the boundary between the null and alternative models. To get around this problem, a built-in simulation to assess goodness of fit is included in SPERMSEG as described above for LINE 7 under OPTIONS FILE.) 3. Identifiability warning Lack of identifiability of a model means that different combinations of parameter values can generate the same probability distribution for the data. This often results in a situation where the parameters of the model cannot be estimated from the data. A trivial example would be if the data consisted of a single one-marker data set, and the model were the full sperm-typing model for a single marker locus, which has 7 parameters (1 segregation, 2 deposit, 2 amplification, 2 contamination). The single one-marker data set consists of 4 counts, 3 of which are freely varying, so clearly 7 parameters cannot be estimated. Identifiability problems can be more subtle. It may be that there are 3 freely varying counts and 3 parameters in the model, but the model may still not be identifiable because the data do not contain information needed to distinguish among different possible combinations of the parameter values. In general, identifiability or non-identifiability of a model may be difficult to prove. To learn more about identifiability, see an intermediate statistics textbook such as Casella and Berger. When does SPERMSEG generate the warning? The 11 parameters of the full two-marker sperm-typing model can be estimated from a single two-marker data set. The SPERMSEG program checks to see that each parameter used for a one-marker data set also is used for a two-marker data set. In that case, the model is identifiable, and no warning is generated. Another common case that is recognized by the program as identifiable and that will not generate a warning is if the only parameters used in one-marker data sets that are not also used in two-marker data sets are deposit parameters. For all other types of parameters, if there is at least one parameter being estimated that appears in no two-marker data set, then SPERMSEG will generate the identifiability warning. The program will still perform the analysis. Generation of the warning does not mean that the model is not identifiable. It merely means that the program cannot verify identifiability. What should you do when you get the warning? First check to see that you entered your model correctly. Next, try to verify identifiability. In some cases you may be able to use the fact that you can't estimate more parameters in a data set than you have degrees of freedom. However, that can only confirm non-identifiability, not identifiability. One practical way to investigate possible problems with identifiability is to try different starting points in the algorithm, using the starting points option described above. A characteristic feature of non-identifiability is that different starting points lead to different estimates, though this can be due to other causes as well. In general, identifiability of a model requires a mathematical proof. The bottom line is that if you are unable to verify that a model is identifiable, then you should not use that model. XII. EXAMPLE DATA SETS/FILES: The software comes with three examples to demonstrate the input and output of the program. Example 1: This is almost the simplest possible example. It consists of a single two-marker data set in which marker A/a is completely linked to the gene and marker B/b is at recombination fraction .01 (see data.test1). The model fit to this data set (see model.test1) is the full sperm-typing model with 11 parameters: s, d1, d2, 4 amplification parameters (1 each for alleles A, a, B, and b), and 4 contamination parameters (1 each for alleles A, a, B, and b). No additional options are requested. The files are: input.test1 data.test1 model.test1 Output for this example can be found in the file output.test1. Example 2: This is a variation on example 1 where the model for the data in the file model.test2 has locus-specific amplification and contamination parameters, i.e. a and A are assumed to have equal amplification rates and equal contamination rates, and b and B are assumed to have equal amplification rates and equal contamination rates distinct from a/A's. Here some special options are used in the file options.test2: a 95% confidence interval for the segregation parameter is calculated. (By changing the first number in the file options.test2 from a "1" to a "2", one could get 95% confidence intervals for all estimated parameters, but that is more time-consuming.) The program outputs expected cell counts, i.e. the number of each of the 16 types of observations that would be expected under the fitted model. The program also outputs the log-likelihood under the full multinomial model. (See OPTIONS above for more details.) The files are: input.test2 data.test1 model.test2 options.test2 Output for this example can be found in the file output.test2. Example 3: In this example, there are 2 donors. Donor 1 has a 2-marker data set (call it 2-marker data set 1) in which markers M1 and M2 are typed, where M1 is completely linked to the gene and M2 is at recombination fraction .01. Donor 1 has another 2-marker data set (call it 2-marker data set 2) in which markers M1 and M3 are typed, where M1 is the same as before and M3 is at recombination fraction .02 from the gene. Donor 2 has a 2-marker data set (call it 2-marker data set 3) in which markers M2 and M3 are typed, and donor 2 also has a 1-marker data set (call it 1-marker data set 1) in which only marker M2 is typed. See data.test3. The model fit to this data set (see model.test3) has segregation parameters set equal to 1/2, experiment-specific d1 parameters, and donor-specific d2 parameters. Each donor has his own allele-specific amplification and contamination parameters. Some special options are used in the file options.test3: cell probabilities under the fitted model are given, a separate log-likelihood is calculated for each data set, log-likelihoods under the full multinomial model are calculated, and starting values for the EM algorithm are specified in the file start.test3 for some of the parameters (for the remaining parameters, defaults are used). (See OPTIONS above for more details.) The files are: input.test3 data.test3 model.test3 options.test3 start.test3 Output for this example can be found in the file output.test3.