What are genome-wide summary data?

Data: phenotype $\sf Y$ and genotype $\sf X$

Size: $\sf n$ (>10K) individuals and $\sf p$ (>1M) genetic variants (SNPs)

$%\scriptsize {\sf Y}:= \left[ \begin{array}{c} {\sf y_1} \\ {\sf y_2} \\ \vdots \\ {\sf y_n} \end{array} \right] ~~~ {\sf X}:= \left[ \begin{array}{ccccc} \mathsf{x_{11}} & \ldots & \mathsf{x_{1j}} & \ldots & \mathsf{x_{1p}}\\ \mathsf{x_{21}} & \ldots & \mathsf{x_{2j}} & \ldots & \mathsf{x_{2p}}\\ \vdots & \ldots & \vdots & \ldots & \vdots \\ \mathsf{x_{n1}} & \ldots & \mathsf{x_{nj}} & \ldots & \mathsf{x_{np}}\end{array} \right]$

Model: single-SNP association analysis

$%\scriptsize \left[ \begin{array}{c} \mathsf{y_{1}} \\ \mathsf{y_{2}} \\ \vdots \\ \mathsf{y_{n}} \end{array} \right] \sim \left[ \begin{array}{c} \mathsf{x_{1j}} \\ \mathsf{x_{2j}} \\ \vdots \\ \mathsf{x_{nj}} \end{array} \right] ~~ \leadsto ~~ \left\{ \begin{array}{cl} \mathsf{\hat{\beta}_j}: & \textsf{marginal effect estimate} \\ \mathsf{\hat{\sigma}_j}: & \textsf{standard error of}~\mathsf{\hat{\beta}_j} \end{array}\right.$

We are interested in a statistical problem motivated by genetics.

Why do we consider multiple-SNP models ${\sf Y} \sim {\sf X_1} + {\sf X_2} + \ldots + {\sf X_p}$ ?

Review: Multivariate linear models for GWAS (Sabatti, 2013).

Select references: Moser et al. (2015); Loh et al. (2015); Bottolo et al. (2013); Zhou et al. (2013); Carbonetto & Stephens (2012); Guan & Stephens (2011); Kang et al. (2010); Logsdon et al. (2010); Yang et al. (2010); Hoggart et al. (2008); Wu et al. (2009); Servin & Stephens (2007).

Why do we consider single-SNP summary data $\{\hat{\beta}_{\sf j}, \hat{\sigma}^{\sf 2}_{\sf j}\}$ ?

Review: Pasaniuc & Price (Nature Reviews Genetics, 2017).

Editorial: Nature Genetics, July 2012.

How do we fit multiple-SNP models using single-SNP summary data?

Reference: Zhu & Stephens (2017, To appear). http://dx.doi.org/10.1101/042457

Regression with Summary Statistics (RSS)

Example: SNP heritability of adult height

SNP heritability (genetics) $\longleftrightarrow$ ${\sf R^2}$ (statistics)

RSS on summary data (# of SNPs 1.1M; sample size 253K):
52.1%, [50.3%, 53.9%]
Linear mixed models on full data (# of SNPs 1.1M; sample size 6K):
49.8%, [41.2%, 58.4%]

References: Wood et al. (2014); Zhou et al. (2013); Yang et al. (2011); Guan & Stephens (2011); Yang et al. (2010).

Having a likelihood opens the door to various applications in genetics.

RSS Likelihood + Different Priors $\leadsto$ Genetic Applications

Eg.1: Estimate genetic architecture

$\beta_{\sf j}\sim \pi_{\sf 0}\cdot \delta_{\sf 0} + {\textstyle \sum}_{\sf k=1}^{\sf K} \pi_{\sf k} \cdot {\sf N}({\sf 0}, \sigma^{\sf 2}_{\sf k})$

Eg.2: Assess gene set enrichment (Today’s Focus)

$\beta_{\sf j}\sim ({\sf 1}-\pi_{\sf j})\cdot \delta_{\sf 0} + \pi_{\sf j}\cdot {\sf N}({\sf 0}, \sigma^{\sf 2}_{\beta})$
$\textsf{log}_{\sf 10}[\pi_{\sf j}/(1-\pi_{\sf j})] = \theta_{\sf 0} + \theta \cdot {\sf 1}\{\textsf{SNP j}\in \textsf{gene set}\}$

Eg.3: Partition heritability by annotations

$\beta_{\sf j}\sim {\sf N}({\sf 0}, \sigma^{\sf 2}_{\sf j})$
${\sf log}(\sigma^{\sf 2}_{\sf j}) = {\sf w}_0 + {\textstyle \sum}_{\sf g=1}^{\sf G} {\sf w}_{\sf g} \cdot {\sf 1}\{\textsf{SNP j}\in \textsf{category g}\}$

What is gene set enrichment?

Phenotype: low-density lipoprotein cholesterol
Pathway: chylomicron-mediated lipid transport (17 genes)
Annotation: Is the SNP near a pathway gene? (yes or no)

Data source: http://csg.sph.umich.edu/abecasis/public/lipids2010 (Teslovich et al. 2010).
Recent reviews: de Leeuw et al. (2016); Pers (2016); Mooney et al. (2014); Wang et al. (2010).

The “enrichment” idea is simple, but there are (at least) two technical issues.

1. If the gene set is truly enriched, we should relax significance threshold for “green” SNPs, but how much to relax?

Data-driven Threshold $\leftarrow$ Function (Pathway, Phenotype)
Maintained type 1 error + Improved power

2. The inflated pattern of “green” curve can be due to correlation between SNPs, rather than enrichment.

SNP 1 has a large genetic effect on a trait.
SNPs 2-100 have zero effect, but all are in high LD with SNP 1.
Thus, SNPs 1-100 all show very large single-SNP z-scores.

We develop a method that systematically utilizes enrichment information.

Select references for M1: Veyrieras et al. (2008); Carbonetto & Stephens (2013); Pickrell (2014); Kichaev et al. (2014); Chen et al. (2016); Li & Kellis (2016); Wen et al. (2017).

We apply the method to 31 complex traits and 4,026 gene sets.

This application is not small:

# of Parameters = 31 $\times$ (3,913+113) $\times$ 1.1 Million $\approx$ 137 Billion

31 complex human phenotypes
3,913 biological pathways curated by experts
113 tissue-based gene sets derived from GTEx RNA-seq
1.1 million common SNPs from 1000 Genomes

One graduate student can get this done, aided by:

Publicly available summary data
Variational Bayes algorithms
Parallel programming
Hierarchical data format (HDF5)
Uchicago Research Computing Center

We make our full results and software publicly available.

Results:
http://xiangzhu.github.io/rss-gsea/results
Software:
https://github.com/stephenslab/rss
Demonstration:
http://stephenslab.github.io/rss/Example-5
R package (in progress):
https://github.com/stephenslab/rssr (N. Knoblauch)

Example: Blood lipid & MTTP gene

Reference: Teslovich et al. (2010).

Example: Blood lipid & MTTP gene

References: Global Lipids Genetics Consortium (2013); Teslovich et al. (2010).

Example: Alzheimer’s disease & Liver

References: Xi et al. (Submitted); Dey et al. (2017); The GTEx Consortium (2015); Lambert et al. (2013).

What did I learn from this thesis?

Basic concepts help solve modern problems.

Bayes’ theorem; Likelihood; Regression; “Enrichment”

Data integration is a powerful idea.

Bayesian hierarchical modelling; “Multiomics”

Large applications require fast computation.

MCMC; Variational inference; Banded matrix; Parallel computing

What’s next?

False discovery rate control [STAT]

Barber & Candès (2015); Stephens (2017); Brzyski et al. (2017) …

Confounding adjustment [STAT]

Leek & Storey (2007); Sun et al. (2012); Risso et al. (2014) …

Advanced computation [STAT]

Liang & Wong (2000); Neal (2011); Hoffman et al. (2013) …

Infer “genetic architecture” [GEN]

Park et al. (2010); Thompson et al. (2015); Speed et al. (2017) …

Model “multiomics” data [GEN]

Nicolae et al. (2010); He et al. (2013); Pickrell (2014) …

Future work: Infer “genetic architecture”

Reference: Zhu & Stephens (2017, Submitted). https://doi.org/10.1101/160770

Future work: Model “multiomics” data

References: Boyle et al. (2017); Hasin et al. (2017); Javierre et al. (2016); Li et al. (2016); Richardson et al. (2016); Ritchie et al. (2015).

Acknowledgments

Thesis Advisor: M. Stephens

Thesis Committee: R. F. Barber, X. He

Statistics Faculty: D. L. Nicolae, P. McCullagh

Past Members: P. Carbonetto, X. Zhou, X. Wen, Y. Guan

Members: J. Blischak, S. Zhao, G. Wang, D. Gerard

N. Knoblauch: `R` package development

M. C. Turchin: red blood cell data

K. K. Dey: GTEx clustering results

O. I. Olopade Lab: breast cancer data

I. Moskowitz Lab: atrial fibrillation data

Acknowledgments

Uchicago Statistics & Human Genetics

GWAS Consortia:

Teslovich et al. (2010); Manning et al. (2012); Morris et al. (2012); van der Harst et al. (2012); Köttgen et al. (2013); Lambert et al. (2013); Okada et al. (2014); Ripke et al. (2014); Wood et al. (2014); Day et al. (2015); Liu et al. (2015); Locke et al. (2015); Nikpay et al. (2015); Shungin et al. (2015); Okbay et al. (2016); van Rheenen et al. (2016)

A Bayesian regression model for genome-wide summary data

Xiang Zhu <xiangzhu@uchicago.edu>

2017-06-30 @ Uchicago Statistics