Search Procedures

This file describes the search procedure for
likelihood/quasi-likelihood maximization as well as the search
parameters max_res, E_int, max_cand and map_res.  (Advice about
setting these parameters is given in "Tips".)

To implement the DHS method for LD mapping, we seek to maximize the
likelihood (or quasi-likelihood) over 1/tau, p, ancestral haplotype,
and variant location, simultaneously.  For a given ancestral haplotype
and variant location, we maximize the likelihood over 1/tau and p
using the Baum/E-M algorithm in a hidden Markov framework, as
described in McPeek and Strahs (1999).  To maximize over all
parameters simultaneously, we implement a directed search over
ancestral haplotype and variant location, maximizing the likelihood
over 1/tau and p for each combination, and choosing the set of
parameters for which the likelihood is highest.  The search over
variant location is straightforward, as the maximized likelihood and
maximizing parameter values change sufficiently smoothly with location
to make a grid search feasible.  Thus, our search strategies focus on
the problem of searching over ancestral haplotype.  Note that the
number of possible ancestral haplotypes would be m^n for n loci each
with m alleles.  Thus, an exhaustive search quickly becomes infeasible
as n grows.

We currently implement a three-stage search procedure that we find
performs well in practice.  It is based on the following observations:
(1) Ancestral haplotype  estimation is generally much easier around
the peak of the  likelihood curve (i.e. when the parameter
representing variant location is close to its maximizing value) than
in an area  of very low likelihood (i.e. when the parameter
representing variant location is set to a value for which the
likelihood will be low when maximized over the other parameters) and
(2) The set of best ancestral haplotypes across different locations of
the variant is generally quite small.  A central strategy of the first
two stages of our approach to ancestral haplotype estimation is the
idea of growing the haplotype out from a given location.   That is, we
fix a site, and consider all 2-locus haplotypes for the 2 markers
flanking that site.  We rank them by log-likelihood and keep the best
"max_cand" (as in "maximum number of candidate ancestral haplotypes").
Then we add the next-nearest marker and consider all possible
haplotypes obtained by combining any of the best max_cand haplotypes
at the first 2 markers with any allele at the 3rd marker, and we keep
the best max_cand of those, and so on.  At the last step, we take the
best haplotype from among those obtained by combining any of the best
max_cand haplotypes at the first n-1 markers with any allele at the
nth marker. We call the above procedure "growing the haplotype" from
the given site.
 
In the first stage of our three-stage approach, we put the variant at
each position on a coarse grid.  The points of the grid are determined
by the marker map and the parameter "max_res" (given in cM) as
follows: between markers l and l+1, there are s(d/(max_res)) evenly
spaced points, where d is the distance (cM) between markers l and l+1
and s(x) is the smallest integer greater than x for any real x.  (Note
on terminology: max_res can be thought of as defining an upper bound
on the distance between grid points or a lower bound on resolution, so
might be more aptly called "min_res".)  At each point of the grid, we
perform the above haplotype-growing procedure, in each case growing
the  haplotype from the putative position of the variant.  From this,
we obtain, for each position of the variant, an estimated ancestral
haplotype and a corresponding log-likelihood.  Let t be the position
of the variant for which the corresponding log-likelihood is the
largest.  In the second stage, we again put the variant at each
position on the coarse grid and perform the above haplotype-growing
procedure, but this time we always grow the haplotype at the fixed
location t, instead of from the putative variant position.  If the
approximate location of the trait-associated variant  is known, the
user may instead specify the interval ("E_int") around the midpoint of
which the haplotype should be grown.  From this, we obtain, for each
position of the variant, a second estimated ancestral haplotype.  In
the third step, we define the set S to consist of all ancestral
haplotypes estimated in the first or second steps.  Then, we put the
variant at each position on a fine grid and maximize the likelihood
over S for each position of the variant.   Alternatively, the user can
specify the set S of ancestral haplotypes over which to maximize,
bypassing stages 1 and 2 of the procedure.  The fine grid of locations
used in the last step is determined by the user and depends on max_res
and "map_res" ("mapping resolution").  Between markers l and l+1,
DHSMAP maximizes the likelihood for s(d/(max_res))*[map_res] variant
locations. For example, if the distance between 2 markers is 0.25 cM,
max_res=0.2 cM, and map_res=20, DHSMAP will maximize the likelihood at
2*20=40 locations between the markers.

As a result of stage 3, we obtain the maximized likelihood and
maximizing parameter values for every putative location of the variant
on the fine grid.   These are used for estimation and confidence
intervals for location of the variant, as well as for estimation of
the amount of LD and degree of heterogeneity.  In the examples we have
considered, including some very complex cases, we have found this
procedure to work well.