A Bayesian regression model for genome-wide summary data
What are genome-wide summary data?
Data: phenotype Y and genotype X
Size: n (>10K) individuals and p (>1M) variants (SNPs)
Y:=[y1y2⋮yn] X:=[x11…x1j…x1px21…x2j…x2p⋮⋮⋮…⋮xn1…xnj…xnp]
Model: single-SNP analysis
[y1y2⋮yn]∼[x1jx2j⋮xnj] ⇝
We are interested in a statistical problem motivated by genetics.

Why do we consider multiple-SNP model Y \sim X_1 + X_2 + \ldots + X_p?
Review: Multivariate linear models for GWAS (Sabatti, 2013).

References: Moser et al. (2015); Loh et al. (2015); Bottolo et al. (2013); Zhou et al. (2013); Carbonetto & Stephens (2012); Guan & Stephens (2011); Kang et al. (2010); Logsdon et al. (2010); Yang et al. (2010); Hoggart et al. (2008); Wu et al. (2009); Servin & Stephens (2007).
Example: SNP heritability of adult height
SNP heritability (genetics) \longleftrightarrow R^2 (statistics)

- RSS on summary data (# of SNPs 1.1M; sample size 253K):
52.1%, [50.3%, 53.9%]
- Linear mixed models on full data (# of SNPs 1.1M; sample size 6K):
49.8%, [41.2%, 58.4%]
References: Wood et al. (2014); Yang et al. (2011).
Having a likelihood opens the door to various applications in genetics.
Eg.1: Estimate genetic architecture

Eg.2: Assess gene set enrichment (Today’s Focus)

Eg.3: Partition heritability by annotations

What is gene set enrichment?
- Phenotype: low-density lipoprotein (Teslovich et al. 2010)
- Pathway: chylomicron-mediated lipid transport (17 genes)
- Annotation: Is the SNP near a pathway gene? (yes or no)

Recent Reviews: de Leeuw et al. (2016); Pers (2016); Mooney et al. (2014); Wang et al. (2010).
We apply the method to 31 phenotypes and 3977 gene sets.
This application is not small:
# of Parameters = 31 \times (3913+64) \times 1.1 Million \approx 136 Billion
- 31 human phenotypes
- 3913 biological pathways curated by experts
- 64 tissue-based gene sets derived from GTEx data
- 1.1 million common SNPs
One graduate student can get this done, aided by:
- Publicly available summary data
- Variational Bayes algorithms
- Banded matrix approximation
- Parallel computing
- Hierarchical data format (
HDF5
)
Example: Blood lipid & MTTP gene

References: Global Lipids Genetics Consortium (2013); Teslovich et al. (2010).
Example: Alzheimer’s disease & Liver

References: The GTEx Consortium (2015); Lambert et al. (2013).
What did I learn from this project?
Basic concepts help solve modern problems.
- Bayes’ theorem; Likelihood function
Data integration is a powerful idea.
- Bayesian hierarchical modelling
Fast computation is important for large-scale applications.
- Variational inference; Parallel computing;
HDF5
Association does not imply causation.
- Short-term goal: checking bias and confounding
- Long-term goal: learning “causality” from data
What’s next?
Develop user-friendly black-box R
package
Apply methods on new datasets
We are glad to help!
Acknowledgments
Joint work with M. Stephens (Thesis Advisor)
Discussions with X. He and P. Carbonetto
N. Knoblauch: R
package development
M. Turchin: red blood cell data
K. Dey: GTEx clustering results

A Bayesian regression model for genome-wide summary data
Xiang Zhu <xiangzhu@uchicago.edu>
2017-02-22 @ Stanford Statistics