A Bayesian regression model for genome-wide summary data
What are genome-wide summary data?
Data: phenotype \(Y\) and genotype \(X\)
Size: \(n\) (>10K) individuals and \(p\) (>1M) variants (SNPs)
\[
Y:= \left[ \begin{array}{c}
y_{1} \\
y_{2} \\
\vdots \\
y_{n} \end{array} \right]
~~~
X:= \left[ \begin{array}{ccccc}
x_{11} & \ldots & x_{1j} & \ldots & x_{1p}\\
x_{21} & \ldots & x_{2j} & \ldots & x_{2p}\\
\vdots & \vdots & \vdots & \ldots & \vdots \\
x_{n1} & \ldots & x_{nj} & \ldots & x_{np}\end{array} \right]
\]
Model: single-SNP analysis
\[
\left[ \begin{array}{c}
y_{1} \\
y_{2} \\
\vdots \\
y_{n} \end{array} \right]
\sim
\left[ \begin{array}{c}
x_{1j} \\
x_{2j} \\
\vdots \\
x_{nj} \end{array} \right]
~~ \leadsto ~~
\left\{ \begin{array}{cl}
\hat{\beta}_j: & \textsf{marginal effect estimate} \\
\hat{\sigma}_j: & \textsf{standard error of}~\hat{\beta}_j
\end{array}\right.
\]
We are interested in a statistical problem motivated by genetics.
Why do we consider multiple-SNP model \(Y \sim X_1 + X_2 + \ldots + X_p\)?
Review: Multivariate linear models for GWAS (Sabatti, 2013).
References: Moser et al. (2015); Loh et al. (2015); Bottolo et al. (2013); Zhou et al. (2013); Carbonetto & Stephens (2012); Guan & Stephens (2011); Kang et al. (2010); Logsdon et al. (2010); Yang et al. (2010); Hoggart et al. (2008); Wu et al. (2009); Servin & Stephens (2007).
Example: SNP heritability of adult height
SNP heritability (genetics) \(\longleftrightarrow\) \(R^2\) (statistics)
- RSS on summary data (# of SNPs 1.1M; sample size 253K):
52.1%, [50.3%, 53.9%]
- Linear mixed models on full data (# of SNPs 1.1M; sample size 6K):
49.8%, [41.2%, 58.4%]
References: Wood et al. (2014); Yang et al. (2011).
Having a likelihood opens the door to various applications in genetics.
Eg.1: Estimate genetic architecture
Eg.2: Assess gene set enrichment (Today’s Focus)
Eg.3: Partition heritability by annotations
What is gene set enrichment?
- Phenotype: low-density lipoprotein (Teslovich et al. 2010)
- Pathway: chylomicron-mediated lipid transport (17 genes)
- Annotation: Is the SNP near a pathway gene? (yes or no)
Recent Reviews: de Leeuw et al. (2016); Pers (2016); Mooney et al. (2014); Wang et al. (2010).
We apply the method to 31 phenotypes and 3977 gene sets.
This application is not small:
# of Parameters = 31 \(\times\) (3913+64) \(\times\) 1.1 Million \(\approx\) 136 Billion
- 31 human phenotypes
- 3913 biological pathways curated by experts
- 64 tissue-based gene sets derived from GTEx data
- 1.1 million common SNPs
One graduate student can get this done, aided by:
- Publicly available summary data
- Variational Bayes algorithms
- Banded matrix approximation
- Parallel computing
- Hierarchical data format (
HDF5
)
Example: Blood lipid & MTTP gene
References: Global Lipids Genetics Consortium (2013); Teslovich et al. (2010).
Example: Alzheimer’s disease & Liver
References: The GTEx Consortium (2015); Lambert et al. (2013).
What did I learn from this project?
Basic concepts help solve modern problems.
- Bayes’ theorem; Likelihood function
Data integration is a powerful idea.
- Bayesian hierarchical modelling
Fast computation is important for large-scale applications.
- Variational inference; Parallel computing;
HDF5
Association does not imply causation.
- Short-term goal: checking bias and confounding
- Long-term goal: learning “causality” from data
What’s next?
Develop user-friendly black-box R
package
Apply methods on new datasets
We are glad to help!
Acknowledgments
Joint work with M. Stephens (Thesis Advisor)
Discussions with X. He and P. Carbonetto
N. Knoblauch: R
package development
M. Turchin: red blood cell data
K. Dey: GTEx clustering results