Genome-wide enrichment analyses of 31 human traits
Enrichment analysis combines multiple sources of association.
- Genotype-phenotype: genome-wide summary data
- Genotype-genotype: linkage disequilibrium
- Gene-gene: biological pathway, RNA-seq, …
Let’s keep this talk “jargon-free”!
1. Genome-wide summary data
2. Enrichment analysis
What are genome-wide summary data?
Data: phenotype \(Y\) and genotype \(X\)
Size: \(n\) (>10K) individuals and \(p\) (>1M) variants (SNPs)
\[
Y:= \left[ \begin{array}{c}
y_{1} \\
y_{2} \\
\vdots \\
y_{n} \end{array} \right]
~~~
X:= \left[ \begin{array}{ccccc}
x_{11} & \ldots & x_{1j} & \ldots & x_{1p}\\
x_{21} & \ldots & x_{2j} & \ldots & x_{2p}\\
\vdots & \vdots & \vdots & \ldots & \vdots \\
x_{n1} & \ldots & x_{nj} & \ldots & x_{np}\end{array} \right]
\]
Model: single-SNP analysis
\[
\left[ \begin{array}{c}
y_{1} \\
y_{2} \\
\vdots \\
y_{n} \end{array} \right]
\sim
\left[ \begin{array}{c}
x_{1j} \\
x_{2j} \\
\vdots \\
x_{nj} \end{array} \right]
~~ \leadsto ~~
\left\{ \begin{array}{cl}
\hat{\beta}_j: & \textsf{marginal effect estimate} \\
\hat{\sigma}_j: & \textsf{standard error of}~\hat{\beta}_j
\end{array}\right.
\]
What is gene set enrichment?
- Phenotype: low-density lipoprotein (Teslovich et al. 2010)
- Pathway: chylomicron-mediated lipid transport (17 genes)
- Annotation: Is the SNP near a pathway gene? (yes or no)
Recent Reviews: de Leeuw et al. (2016); Pers (2016); Mooney et al. (2014); Wang et al. (2010).
The idea is simple, but there are (at least) two statistical issues.
1. We should relax significance threshold for “green” SNPs, but how much to relax?
Threshold \(\longleftarrow\) Function (Pathway, Phenotype)
2. The inflated pattern of “green” curve can be due to correlation between SNPs.
An extreme example:
- SNP 1 has a large genetic effect on a trait.
- SNPs 2-100 have zero effect, but all are in high LD with SNP 1.
- Thus, SNPs 1-100 all show very large single-SNP z-scores.
We apply the method to 31 phenotypes and 3977 gene sets.
This application is not small:
# of Parameters = 31 \(\times\) (3913+64) \(\times\) 1.1 Million \(\approx\) 136 Billion
- 31 human phenotypes
- 3913 biological pathways curated by experts
- 64 tissue-based gene sets derived from GTEx data
- 1.1 million common SNPs
One graduate student can get this done, aided by:
- Publicly available summary data
- Variational Bayes algorithms
- Banded matrix approximation
- Parallel computing
- Hierarchical data format (
HDF5
)
Example: Blood lipid & MTTP gene
References: Global Lipids Genetics Consortium (2013); Teslovich et al. (2010).
Example: Alzheimer’s disease & Liver
References: The GTEx Consortium (2015); Lambert et al. (2013).
What did I learn from this project?
Data integration is a powerful idea.
- Bayesian hierarchical modelling
Fast computation is important for large-scale applications.
- Variational inference; Parallel computing;
HDF5
Association does not imply causation.
- Short-term goal: checking bias and confounding
- Long-term goal: learning “causality” from data
What’s next?
Develop user-friendly black-box R
Apply methods on new datasets
We are glad to help!
Acknowledgments
Joint work with M. Stephens (Thesis Advisor)
Discussions with X. He and P. Carbonetto
N. Knoblauch: R
package development
M. Turchin: red blood cell data
K. Dey: GTEx clustering results