Mary
Sara McPeek Professor, Departments of Statistics and Human
Genetics, and the College; Member, Committee on Genetics, Genomics and Systems
Biology; Senior Fellow, Computation Institute
My research focuses on applications of probability and statistics to
genetics and molecular biology. Following are some of my recent and ongoing
projects:
I. Case-control association testing with related individuals (Bourgain
et al. (2003) Am J Hum Genet 73: 612-626; Thornton and McPeek (2007) Am
J Hum Genet 81:321-337; CC-QLS and MQLS software available on this site)
In Bourgain et al. (2003), we developed a QLS method for case-control
association testing in samples that contain related individuals. The test
statistic is constructed based on null and alternative means and the null
covariance matrix of a function of genotype indicators. Choice of an alternative
mean model affects the power, but not the validity of the test. The alternative
mean model used in the WQLS of Bourgain et al. (2003) is based on a simple
case-control allele frequency difference. We implemented our method in
a computationally efficient algorithm, and we applied it to a Hutterite
sample (from an isolated population with large, inbred pedigree) in which
we detected a highly-significant novel association between atopy (an asthma-related
phenotype) and an amino-acid polymorphism in the P-selectin gene. We demonstrated
that, for the chosen alternative mean model, our QLS test is asymptotically
locally most powerful in a general class of linear tests.
We followed up on this work with Thornton and McPeek (2007), in which
the major development was a novel construction of an alternative mean model
with a direct connection to genetic models (a reversal of conditioning
under the assumption of a very general mode of inheritance with a small
effect of the locus on the trait). In the resulting alternative mean model
for the genotype indicators, the expected frequency of a predisposing allele
in an individual depends not only on the individual's phenotype, but on
the phenotypes of relatives as well. This is a desirable property because
complex genetic models imply an enrichment for predisposing variants in
affected individuals with affected relatives compared to affected individuals
without affected relatives. Our resulting MQLS case-control association
test has similar optimality properties as in Bourgain et al. (2003) but
for the improved alternative model, leading to a substantial power improvement
in simulations under various multilocus models. At the same time the MQLS
test retains the appealing computational simplicity of the method of Bourgain
et al. (2003). Other properties of the MQLS include: (1) it is applicable
to completely general combinations of family and case-control designs,
including samples from isolated founder populations; (2) it can incorporate
both unaffected controls and controls of unknown phenotype into the same
analysis; and (3) it can incorporate phenotype information on relatives
with missing genotype data. Using the method to reanalyze the GAW 14 COGA
data, we detected highly significant association to an alcoholism-related
phenotype for four different SNPs. Three of these four significant associations
were not detected in previous studies. Our software, including source code,
is freely available on my website: the CC-QLS package implements the methods
of Bourgain et al. (2003), and the MQLS package is an expanded version
of CC-QLS which incorporates the methods of Thornton and McPeek (2007).
II. Multipoint linkage disequilibrium mapping with block haplotype structure
(Zheng and McPeek (2004) Springer Lecture Notes in Computer Science 2983:113-123;
Zheng and McPeek (2007) Am J Hum Genet 80:112-125)
In Zheng and McPeek (2004), we developed a class of hidden Markov models
for background LD based on the block structure of haplotypes, and we fit
these models to dense SNP data from an outbred Caucasian population. Our
class of models allows a fairly general graph structure of preferred and
non-preferred transitions based on the haplotype block structure. It allows
for common haplotypes and uncommon haplotypes in each block, and it captures
the idea of ancestral haplotypes. We use a parametric bootstrap approach
to assess goodness of fit, which allows a wide latitude in choice of test
statistic. We implemented an additional layer of Monte Carlo to assess
the type I error of the parametric bootstrap procedure for assessment of
goodness of fit.
In Zheng and McPeek (2007), we followed up on this work by applying our
models of background LD based on block haplotype structure to the problem
of multipoint LD mapping from dense SNP data in case-control samples from
an outbred population. We developed a virtual variant approach that characterizes
untyped SNPs by various partitions of the set of haplotypes within a block
into two disjoint subsets, corresponding to two alleles. We demonstrated
that the virtual variant method greatly increases power for detection of
untyped common variants associated with a trait. Because full multipoint
LD mapping can be slow, we exploited the haplotype block information to
develop a fast single-block multipoint mapping method. Our methods are
appropriate for genotype data and take into account the uncertainty in
phase. Our simulations indicate that the most important gains from taking
into account the haplotype block structure at the analysis stage of multipoint
LD mapping come from (1) greatly increased power to detect association
with untyped variants, and (2) greatly improved localization of untyped
variants associated with the trait.
III. Multipoint linkage disequilibrium mapping by the decay of haplotype
sharing (McPeek and Strahs (1999) Am J Hum Genet 65:858-875; Strahs and
McPeek (2003) Festschrift for Terry Speed pp. 343-366; Zhang, Schneider,
Ober, McPeek (2005) Genet Epi 29:128-140, DHSMAP and DHSMAP_PVM software
available on this site)
In McPeek and Strahs (1999), we proposed a multipoint approach to linkage
disequilibrium mapping. For each individual, a likelihood for multilocus
data is calculated, while incorporation of dependence of recombinational
history among related individuals is based on estimating equations that
can be thought of as generalizations of both quasi-likelihood and composite
likelihood. McPeek and Strahs (1999) assumed a conditional coalescent model
for the relationships among individuals.
In Strahs and McPeek (2003), we addressed the problems of (1) modeling
background LD in an outbred population and (2) incorporating the background
LD model into our decay of haplotype sharing method in outbred samples.
We developed a Markov model of order 2 for background LD in haplotypes
of moderately dense SNPs, and we developed a hidden Markov implementation
of the model for use with unphased genotype in our decay of haplotype sharing
method. We used the AIC and BIC model selection criteria to compare models
of background LD and found that the Markov(2) model provided a major improvement
over a Markov(1) model. Within the context of the decay of haplotype sharing
method, we demonstrated the importance of appropriate modeling of background
LD, and we developed a mapping-in-controls diagnostic to detect the possibility
that lack of fit of the background model would be having an influence in
the analysis. Software for the method, including source code, is freely
available on this site.
In Zhang et al. (2005), we addressed the problem of multilocus linkage
disequilibrium (LD) mapping of a trait-associated variant from case-control
samples in which some individuals may be related, with special attention
to the extreme case of an isolated founder population. Our method, which
we call DHS-R, is an extension of our previous decay of haplotype sharing
(DHS) method. The DHS-R method shares the main features of the DHS method:
(i) it allows construction of a confidence interval for the location of
a trait-associated variant; (ii) it allows for missing observations and
unphased genotype data, with the uncertainty in the haplotypes taken into
account in the analysis; (iii) it allows for heterogeneity, mutation, recombination,
and background LD. The main advances of the DHS-R are (i) the ability to
include individuals of arbitrary known relationship (including inbreeding)
in the case and control samples; (ii) an extension to allow partially-phased
haplotypes derived from case-parent trio genotype data; and (iii) an extension
to allow for genotyping error in the model. Our method, which uses a hidden
Markov model for likelihood calculation and maximization, has the advantage
of being computationally feasible even in a large, complex pedigree. Simulations
based on a 13-generation, 1623-member Hutterite pedigree demonstrated accurate
coverage of the confidence intervals for location of the variant. We applied
the method to fine-mapping of a susceptibility locus for the asthma-associated
phenotype, bronchial hyperresponsiveness (BHR), in the Hutterites, on a
region of chromosome 19.
IV. Application of quasi-likelihood to testing for Hardy-Weinberg in
samples with related individuals (Bourgain, Abney, Schneider, Ober, McPeek
(2004) Genetics 168:2349-2361)
In Bourgain et al. (2004), we demonstrated that when the classical chi^2
goodness-of-fit test for Hardy-Weinberg equilibrium (HWE) is used on samples
with related individuals, the type I error can be greatly inflated. In
particular the test is inappropriate in population isolates where the individuals
are related through multiple lines of descent. In Bourgain et al. (2004),
we proposed a novel quasi-likelihood score (QLS) test of HWE suitable for
any sample with related individuals. Performed conditional on the pedigree
structure, our test detects departures from HWE that are not due to the
genealogy.
V. Best linear unbiased estimation of allele frequencies (McPeek, Wu,
Ober (2004) Biometrics 60:359-367)
In McPeek et al. (2004), we addressed the problem of efficient allele
frequency estimation in an isolated founder population in which all individuals
are related by a large, complex, pedigree with multiple inbreeding loops.
We developed a quasi-likelihood (QL) estimator, which for this problem
is also the best linear unbiased estimator, where the QL estimator weights
the individuals based on their kinship to all the other individuals in
the sample. We developed and implemented an efficient algorithm for computing
the estimate and its variance, and we applied our method to allele frequency
estimation in (1) a Hutterite data set containing over 800 individuals
related by a 13-generation 1623-person pedigree as well as in (2) an outbred
sample of 996 individuals drawn from 85 moderate-size pedigrees. Notably,
our QL estimator has very close performance to the maximum likelihood estimator
(when it is feasible to calculate the latter), but is substantially easier
to calculate, making it feasible to use for large numbers of markers even
in large, complex pedigrees. In the context of high-density scans, its
accuracy and computational efficiency make it a valuable tool in samples
composed of moderate-size pedigrees as well. Our software, including source
code, is freely available on this site as part of the CC-QLS package.
VI. Identification of polymorphisms that explain a linkage result (Sun,
Cox, McPeek (2002) Am J Hum Genet 70:399-411; STEPC software freely available
on the web)
In Sun et al. (2002), we developed a statistical method for identification
of polymorphisms that explain a linkage result. Given many polymorphic
sites genotyped in a region showing strong linkage with a trait, our goal
is to determine which site or combination of sites in the region influences
susceptibility to the trait. Our approach is to use linkage data to identify
the polymorphisms whose genotypes could fully explain the observed linkage
to the region. The information provided by this analysis is different from
that provided by either linkage or association studies. Our approach is
based on the observation that if a particular site is the only site in
the region that influences the trait, then conditional on the genotypes
at that site for the affected relatives, there should be no unexplained
over-sharing among the affecteds in the region. Our method is applicable
to sibships and allows for a very general model for how the site influences
the trait, including epistasis with unlinked loci, environmental effects
and gene-environment interaction. We perform hypothesis tests and derive
a confidence set for the true causal polymorphic site, under the assumption
that there is only one site in the region influencing the trait. Future
work will initially focus on the problem of multiple causal sites present
in the region.
VII. Analysis of quantitative trait loci in the Hutterites (Abney, McPeek,
Ober (2000) Am J Hum Genet 66:629-650; Abney, McPeek, Ober (2001) Am J
Hum Genet 68:1302-1307; Ober, Abney, McPeek (2001) Am J Hum Genet 69:1068-1079;
Newman et al. (2001) Am J Hum Genet 69:1146-1148; Abney, Ober, McPeek (2002)
Am J Hum Genet 70:920-934; Newman et al. 2003, Newman et al. 2004, Weiss
et al. 2004)
In Abney et al. (2000; 2001; 2002), we developed statistical methods
for analysis of quantitative traits in founder populations. We have applied
the methods to genetic analysis in a Hutterite population. The complexity
of this large inbred pedigree poses special challenges and makes many standard
types of analyses computationally onerous or completely infeasible. At
the same time, certain features of this population make it extremely promising
for genetic analysis of complex traits: a small number of founders presumably
leading to reduced genetic heterogeneity, close-knit social structure and
communal living which are expected to reduce environmental heterogeneity.
Methods of analysis must generally be tailor-made for application to founder
populations, and major computational problems must often be overcome. We
have developed and implemented variance component methods and linkage disequilibrium
mapping methods designed especially for founder populations. We have also
developed a novel permutation-based assessment of significance that is
applicable to data on related individuals, based on a general class of
matrix decompositions, of which the Cholesky decomposition is a special
case.
VIII. Relationship inference (McPeek and Sun 2000; Sun, Abney, McPeek
2001; Sun, Wilder, McPeek 2002; McPeek 2002; PREST software freely available
on the web)
Lei Sun and I have developed several approaches for the problem of detecting
relationship errors in pedigrees on the basis of genome screen data collected
for linkage studies. We have developed methods for simple outbred pedigrees
as well as for the much more difficult situation of a large, complex, inbred
pedigree. Part of this work is related to identifiability of hidden Markov
models and efficient methods for determination of the orbits of the group
of symmetries on the hypercube that leave certain sets invariant.
IX. Optical mapping (Tong, Mets, McPeek (2007))
Multi-color optical mapping is a new technique being developed, in the
Mets lab at U. of C., to obtain detailed physical maps (indicating relative
positions of various recognition sites) of DNA molecules. We consider a
study design in which the data consist of noisy observations of multiple
copies of a DNA molecule marked with colors at recognition sites. The primary
goal is to estimate a physical map. A secondary goal is to estimate error
rates associated with the experiment, which are potentially useful for
analysis and refinement of the biochemical steps in the mapping procedure.
We propose statistical models for various sources of error and use maximum
likelihood estimation (MLE) to construct a physical map and estimate error
rates. To overcome difficulties arising in the maximization process, a
latent-variable Markov chain version of the model is proposed, and the
EM algorithm is used for maximization. In addition, a simulated annealing
procedure is applied to maximize the profile likelihood over the discrete
space of sequences of colors. We apply the methods to simulated data on
the bacteriophage lambda genome.
X. Other work includes
A. Statistical models for recombination and interference (Speed, McPeek,
Evans (1992) PNAS 89:3103-3106; Evans, McPeek, Speed (1993) Theor Pop Biol
43:80-90; McPeek and Speed (1995) Genetics 139:1031-1044; Zhao, Speed,
McPeek (1995) Genetics 139:1045-1056; Zhao, McPeek, Speed (1995) Genetics
139:1057-1065; Armstrong, McPeek, Speed (2006) Biostatistics 7:374-386)
B. Optimal allele-sharing statistics for genetic mapping of affected
pedigree members (McPeek 1999)
C. Statistical inference for sperm-typing data (Leeflang, McPeek, Arnheim
1996; Grewal et al. 1999; McPeek 1999; Girardet et al 2000)