What are genome-wide summary data?

Data: phenotype $Y$ and genotype $X$

Size: $n$ (>10K) individuals and $p$ (>1M) variants (SNPs)

$Y:= \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right] ~~~ X:= \left[ \begin{array}{ccccc} x_{11} & \ldots & x_{1j} & \ldots & x_{1p}\\ x_{21} & \ldots & x_{2j} & \ldots & x_{2p}\\ \vdots & \vdots & \vdots & \ldots & \vdots \\ x_{n1} & \ldots & x_{nj} & \ldots & x_{np}\end{array} \right]$

Model: single-SNP analysis

$\left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right] \sim \left[ \begin{array}{c} x_{1j} \\ x_{2j} \\ \vdots \\ x_{nj} \end{array} \right] ~~ \leadsto ~~ \left\{ \begin{array}{cl} \hat{\beta}_j: & \textsf{marginal effect estimate} \\ \hat{\sigma}_j: & \textsf{standard error of}~\hat{\beta}_j \end{array}\right.$

We are interested in a statistical problem motivated by genetics.

Why do we consider multiple-SNP model $Y \sim X_1 + X_2 + \ldots + X_p$ ?

Review: Multivariate linear models for GWAS (Sabatti, 2013).

References: Moser et al. (2015); Loh et al. (2015); Bottolo et al. (2013); Zhou et al. (2013); Carbonetto & Stephens (2012); Guan & Stephens (2011); Kang et al. (2010); Logsdon et al. (2010); Yang et al. (2010); Hoggart et al. (2008); Wu et al. (2009); Servin & Stephens (2007).

Why do we consider single-SNP summary data $\{\hat{\beta}_j, \hat{\sigma}^2_j\}$ ?

Review: Pasaniuc & Price (Nature Reviews Genetics, 2017).

Source: http://www.nature.com/ng/journal/v44/n7/full/ng.2345.html

How do we fit multiple-SNP models using single-SNP summary data?

Reference: Zhu and Stephens (2016). http://dx.doi.org/10.1101/042457

Regression with Summary Statistics (RSS)

Special case: $L_{\sf rss}(\beta)=N(\widehat{\beta}; \beta, \widehat{S}^2)$ when $\widehat{R}$ is identity

“Implied likelihood” based on confidence interval (Efron, 1993)

We derive RSS using asymptotic theory.

Example: SNP heritability of adult height

SNP heritability (genetics) $\longleftrightarrow$ $R^2$ (statistics)

RSS on summary data (# of SNPs 1.1M; sample size 253K):
52.1%, [50.3%, 53.9%]
Linear mixed models on full data (# of SNPs 1.1M; sample size 6K):
49.8%, [41.2%, 58.4%]

References: Wood et al. (2014); Yang et al. (2011).

Having a likelihood opens the door to various applications in genetics.

RSS Likelihood + Different Priors $\leadsto$ Genetic Applications

Eg.1: Estimate genetic architecture

Eg.2: Assess gene set enrichment (Today’s Focus)

Eg.3: Partition heritability by annotations

What is gene set enrichment?

Phenotype: low-density lipoprotein (Teslovich et al. 2010)
Pathway: chylomicron-mediated lipid transport (17 genes)
Annotation: Is the SNP near a pathway gene? (yes or no)

Recent Reviews: de Leeuw et al. (2016); Pers (2016); Mooney et al. (2014); Wang et al. (2010).

We develop a method that systematically leverages enrichment information.

We apply the method to 31 phenotypes and 3977 gene sets.

This application is not small:

# of Parameters = 31 $\times$ (3913+64) $\times$ 1.1 Million $\approx$ 136 Billion

31 human phenotypes
3913 biological pathways curated by experts
64 tissue-based gene sets derived from GTEx data
1.1 million common SNPs

One graduate student can get this done, aided by:

Publicly available summary data
Variational Bayes algorithms
Banded matrix approximation
Parallel computing
Hierarchical data format (HDF5)

Our full results are publicly available.

Results: http://xiangzhu.github.io/rss-gsea/_book
Software: https://github.com/stephenslab/rss

Pathway enrichment mirrors trait category.

Example: Blood lipid & MTTP gene

References: Global Lipids Genetics Consortium (2013); Teslovich et al. (2010).

Example: Alzheimer’s disease & Liver

References: The GTEx Consortium (2015); Lambert et al. (2013).

What did I learn from this project?

Basic concepts help solve modern problems.

Bayes’ theorem; Likelihood function

Data integration is a powerful idea.

Bayesian hierarchical modelling

Fast computation is important for large-scale applications.

Variational inference; Parallel computing; HDF5

Association does not imply causation.

Short-term goal: checking bias and confounding
Long-term goal: learning “causality” from data

A Bayesian regression model for genome-wide summary data

Xiang Zhu <xiangzhu@uchicago.edu>

2017-02-22 @ Stanford Statistics