Input Required Input Files: 1. phenotype data file (default filename is "pedigree") This file contains the pedigree and phenotype information. Individuals who are not listed in this file will not be included in the analysis. 1 1 7 6 1 1 1 2 7 6 2 2 1 3 7 6 1 2 1 7 0 0 1 1 1 6 0 0 2 0 2 11 18 19 2 2 2 12 18 19 1 1 2 18 0 0 1 0 2 19 15 16 2 1 2 15 0 0 1 0 2 16 0 0 2 0 (1) (2) (3) (4) (5) (6) (1) family ID (positive integer) (2) individual ID (positive integer; must be unique) (3) father's ID (0 if the individual is a founder) (4) mother's ID (0 if the individual is a founder) (5) sex (1=male, 2=female) (6) affection status (0=unknown, 1=unaffected, 2=affected) Sampled individuals who are unrelated to anyone else in the sample should be included by giving each such person their own unique family ID (as well as unique individual ID) and setting both parents' IDs to 0. There is no limit on the number of individuals nor on the number of families. Each individual should be entered only once. The individual ID is required to be unique (e.g. it cannot be reused in a different family). Individuals from the same family should appear in a single cluster, though there is no requirement on the order of individuals within a family nor on the order in which different families are listed. The default filename is "pedigree". To specify a different filename, use the command-line flag -pheno followed by the filename. For example, to use a phenotype data file called "myphenofile", you could type the command ./ATRIUM -pheno myphenofile 2. marker data file (default filename is "markid") This file contains the marker data. All markers should be on the same chromosome. (To analyze more than one chromosome, a separate run must be performed for each chromosome, with each chromosome having its own marker datafile.) marker chromosome position orientation allele0 allele1 1 2 3 7 6 11 12 18 19 15 16 rs7909677 10 101955 + A G AG AA AA AA AG AG GG GG AG AG AG rs9419560 10 142201 + A G AA GG AG AG AG AG NN GG AG AA GG rs9419419 10 153707 - T C TC TC CC TC TC TT TC TT TC TT CC (1) (2) (3) (4) (5) (6) (7) (8) (9)(10)(11)(12)(13)(14)(15)(16)(17) (1) marker rs number (2) chromosome (3) physical position (4) strand orientation ("+"=same strand as HapMap, "-"=opposite strand from HapMap) (5) nucleotide for allele 0 (6) nucleotide for allele 1 (7)... marker genotypes (NN for missing genotype) The first row of the file must contain the column headings. The headings for the first 6 columns can be arbitrary, but should not contain any space characters. Columns 7 and beyond contain marker genotype data for the sampled individuals, and each of these columns must have the corresponding individual's ID number as the heading. The order of the individuals is not required to be the same as the order in the pedigree file. The column headings must specify the order. There is no limit on the number of markers. However, all markers should be on the same chromosome. All individuals in the marker data file should also appear in the phenotype data file, otherwise, they will not be included in the analysis. The number of columns should be the same for every marker: Use NN for missing genotype. The default filename is "markid". To specify a different filename, use the command-line flag -geno followed by the filename. For example, to use a marker data file called "mymarkfile" you could type the command ./ATRIUM -geno mymarkfile 3. The IBD coefficient file (default filename is "ibdcoef") This file contains condensed identity coefficients for every pair of eligible individuals within each family (including an individual with himself/herself), where an individual is eligible if he or she has either (1) known affection status or (2) non-missing genotype for at least one marker. (E.g. an individual with unknown phenotype is still eligible if he or she has any non-missing genotype information.) IBD coefficients should be included for every pair of eligible individuals who have the same family ID (including each individual with himself/herself). A sampled individual who does not share a family ID with anyone else in the sample, would be represented in the markid file by a single line that gives the IBD coefficients for the person with himself/herself. The IBD coefficient file has the following format: 1 1 0 0 0 0 0 0 1 0 0 1 2 0 0 0 0 0 0 0.25 0.5 0.25 1 3 0 0 0 0 0 0 0.25 0.5 0.25 1 7 0 0 0 0 0 0 0 1 0 1 6 0 0 0 0 0 0 0 1 0 2 2 0 0 0 0 0 0 1 0 0 2 3 0 0 0 0 0 0 0.25 0.5 0.25 2 7 0 0 0 0 0 0 0 1 0 2 6 0 0 0 0 0 0 0 1 0 3 3 0 0 0 0 0 0 1 0 0 3 7 0 0 0 0 0 0 0 1 0 3 6 0 0 0 0 0 0 0 1 0 7 7 0 0 0 0 0 0 1 0 0 7 6 0 0 0 0 0 0 0 0 1 6 6 0 0 0 0 0 0 1 0 0 11 11 0 0 0 0 0 0 1 0 0 11 12 0 0 0 0 0 0 0.25 0.5 0.25 . . . . . . . . . . . . . . . . . . . . . . (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (1) individual 1 ID (2) individual 2 ID (3)..(11) condensed identity coefficients 1 through 9 between individuals 1 and 2 (Ken Lange's book, Mathematical and Statistical Methods for Genetic Analysis, has a good description of condensed identity coefficients) Note that ATRIUM currently permits only outbred individuals in the analysis. For a pair of outbred individuals, columns (3)-(8) will always be 0, and columns (9), (10) and (11) represent the probabilities of sharing 2, 1 or 0 alleles IBD, respectively. If, for any pair of individuals, at least one of the values in columns (3)-(8) is not zero, these individuals will be excluded from the analysis. The individual IDs in this file should correspond to the individual ID's used in the phenotype data file. The software program that can be used to obtain IBD coefficients is -- The IdCoefs software by Mark Abney, which can be found at http://home.uchicago.edu/~abney/Software.html The IdCoefs software computes condensed identity coefficients for pairs of individuals within each family. The output of IdCoefs can be directly used as input to ATRIUM. The default filename is "ibdcoef". To specify a different filename, use the command-line flag -ibd followed by the filename. For example, to use an IBD coefficient file called "myibdfile" you could type the command ./ATRIUM -ibd myibdfile 4. The multilocus LD database file (default filename is "database") This file contains information on the joint distribution of untyped SNPs with their tag SNPs in the reference panel. The software program that can be used to obtain the multilocus LD database file is -- The tuna_db program of the TUNA package by William Wen and Dan Nicolae. The output of tuna_db has the exact format required for the ATRIUM software. The tuna_db program can be found at http://www.stat.uchicago.edu/~wen/tuna The resulting multilocus LD database file has the following format: rs7909677 1 101955 A G 0.3508 0.1257 0.3508 4 rs2060138:rs4881551:rs3125023:rs1476130 0:18:26:0.6923_1:0:2:0.0000_2:2:2:1.0000_3:3:3:1.0000_4:1:1:1.0000_5:19:19:1.0000_7:22:22:1.0000_8:1:1:1.0000_9:3:3:1.0000_10:1:1:1.0000_11:37:37:1.0000_12:2:2:1.0000_15:1:1:1.0000 rs11591988 0 116070 C T 0.1846 0.1685 0.1685 1 rs10794885 0:76:77:0.9870_1:31:43:0.7209 rs2379071 0 116237 A G 0.4012 0.2414 0.3699 2 rs9419560:rs2060138 0:9:30:0.3000_1:3:3:1.0000_2:2:86:0.0233_3:1:1:1.0000 rs12773042 0 117636 C G 0.2642 0.1493 0.2521 3 rs2060138:rs4881551:rs4880568 0:0:3:0.0000_1:0:19:0.0000_3:0:26:0.0000_4:9:27:0.3333_5:2:5:0.4000_6:0:3:0.0000_7:0:37:0.0000 . . . . . . . . . . . . . . . . . . . . . . (1) (2) (3) (4)(5) (6) (7) (8) (9) (10) (11) (1) marker rs number (2) typed or not (0=untyped, 1=typed) (3) physical position (4) nucleotide for allele 0 (5) nucleotide for allele 1 (6) maximum multilocus LD measure M_D (7) maximum pairwise LD measure r^2 (8) multilocus LD measure M_D with tag SNPs listed in column (10) (9) number of tag SNPs (10) list of tag SNPs (11) information on the joint distribution of untyped SNPs with their tag SNPs in the reference panel Detailed explanation of column (11): For a given untyped SNP, column (11) is divided into h subfields, where h is the number of tag SNP haplotypes, for the given untyped SNP, that occur in the reference panel, and where the subfields are separated by underscores ("_"). Each subfield is further separated into 4 entries, where the entries are separated by colons (":"). The first 3 entries must be integers, and the 4th entry is a double-precision number. For a given untyped SNP, each tag SNP haplotype that occurs in the reference panel is coded as a nonnegative integer, corresponding to a binary representation. For example, tag SNP haplotype 0000 is coded as 0, 1000 is coded as 1, 0100 is coded as 2, 1100 is coded as 3, etc., where the order of the tag SNPs is the same as in (10). Each subfield corresponds to a tag SNP haplotype, and the haplotype code must be the first entry in the subfield. E.g. if tag SNP haplotype 0000 occurs in the reference panel, then, in its subfield, the first entry would be 0. Similarly, if tag SNP haplotype 1000 occurs in the reference panel, then, in its subfield, the first entry would be 1. If a haplotype does not appear in the reference panel, then there should be no subfield for that haplotype. The second entry in the subfield corresponding to tag SNP haplotype H is the count of haplotypes in the reference panel for which the tag SNP haplotype is H and the untyped SNP allele is 1. The third entry in the subfield corresponding to tag SNP haplotype H is the total count of type H haplotypes in the reference panel. The fourth entry in each subfield is equal to (entry 2)/(entry 3), which represents the estimated conditional probability of allele 1 at the untyped SNP given haplotype H at the tag SNPs. Special note on phased versus unphased reference panel: As of October 2009, the current implementation of tuna_db requires a phased reference panel. ATRIUM allows an unphased reference panel, but we do not currently provide a routine to generate the multilocus LD database input file in that case. In order to generate such a database input file yourself, you could replace entry 4 of each subfield with an estimated conditional probability, in the reference panel, of allele 1 at the untyped SNP given haplotype H at the tag SNPs, where this estimated conditional probability could be obtained as a ratio of the corresponding haplotype frequency estimates from an EM algorithm approach or from one of the current imputation models/methods. It is important to note that ATRIUM actually ignores entries 2 and 3 in each subfield of item (11), but reads them in as integers, so for an unphased reference panel, entries 2 and 3 could be set to be arbitrary integers. The default filename is "database". To specify a different filename, use the command-line flag -db followed by the filename. For example, to use a multilocus LD database file called "myldfile" you could type the command ./ATRIUM -db myldfile 5. The parameter file (default filename is "parameter") This file contains one number: an estimate of the population prevalence of the binary trait. This prevalence value is used in the calculation of the ATRIUM statistic. This should not be prevalence in the case-control sample, but rather the "general population" prevalence for an appropriate reference population. The default filename is "parameter". To specify a different filename, use the command-line flag -r followed by the filename. For example, to use a parameter file called "myprev" you could type the command ./ATRIUM -r myprev Optional Input: 6. M_D threshold value (default value is 0.4) This value is the threshold (which must be a number between 0 and 1) for the minimum allowable amount of information on an untyped SNP, based on its tag SNPs, where information is measured by M_D (Nicolae 2006). M_D quantifies how much of the information, on a given untyped SNP, is captured by its tag SNPs (where 0 is no information and 1 is perfect information). The default value is .4. An untyped SNP is considered for testing only if its M_D value, based on its tag SNPs, is strictly greater than this threshold. The default M_D threshold is .4. To change the M_D threshold, use the command-line flag -md followed by the threshold value. For example, to use a more stringent M_D threshold of .75, you could type the command ./ATRIUM -md .75 7. r^2 threshold value (default value is 1) This value is the threshold (which must be a number between 0 and 1) for the maximum allowable r^2 between an untyped SNP and any of its tag SNPs, where r^2 is the square of the correlation coefficient. The default value is 1. An untyped SNP is considered for testing only if its maximum r^2 with any tag SNP is strictly less than this threshold. The default r^2 threshold is 1. To change the r^2 threshold, use the command-line flag -r2 followed by the threshold value. For example, to use a more stringent threshold of .9, you could type the command ./ATRIUM -r2 .9