A Bayesian regression model for genome-wide summary data

Xiang Zhu <xiangzhu@uchicago.edu>

2017-02-22 @ Stanford Statistics

What are genome-wide summary data?

Data: phenotype \(Y\) and genotype \(X\)

Size: \(n\) (>10K) individuals and \(p\) (>1M) variants (SNPs)

\[ Y:= \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right] ~~~ X:= \left[ \begin{array}{ccccc} x_{11} & \ldots & x_{1j} & \ldots & x_{1p}\\ x_{21} & \ldots & x_{2j} & \ldots & x_{2p}\\ \vdots & \vdots & \vdots & \ldots & \vdots \\ x_{n1} & \ldots & x_{nj} & \ldots & x_{np}\end{array} \right] \]

Model: single-SNP analysis

\[ \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right] \sim \left[ \begin{array}{c} x_{1j} \\ x_{2j} \\ \vdots \\ x_{nj} \end{array} \right] ~~ \leadsto ~~ \left\{ \begin{array}{cl} \hat{\beta}_j: & \textsf{marginal effect estimate} \\ \hat{\sigma}_j: & \textsf{standard error of}~\hat{\beta}_j \end{array}\right. \]

We are interested in a statistical problem motivated by genetics.

Why do we consider multiple-SNP model \(Y \sim X_1 + X_2 + \ldots + X_p\)?

Review: Multivariate linear models for GWAS (Sabatti, 2013).

References: Moser et al. (2015); Loh et al. (2015); Bottolo et al. (2013); Zhou et al. (2013); Carbonetto & Stephens (2012); Guan & Stephens (2011); Kang et al. (2010); Logsdon et al. (2010); Yang et al. (2010); Hoggart et al. (2008); Wu et al. (2009); Servin & Stephens (2007).

Why do we consider single-SNP summary data \(\{\hat{\beta}_j, \hat{\sigma}^2_j\}\)?

Review: Pasaniuc & Price (Nature Reviews Genetics, 2017).

Source: http://www.nature.com/ng/journal/v44/n7/full/ng.2345.html

How do we fit multiple-SNP models using single-SNP summary data?

Reference: Zhu and Stephens (2016). http://dx.doi.org/10.1101/042457

Regression with Summary Statistics (RSS)

Special case: \(L_{\sf rss}(\beta)=N(\widehat{\beta}; \beta, \widehat{S}^2)\) when \(\widehat{R}\) is identity
“Implied likelihood” based on confidence interval (Efron, 1993)

We derive RSS using asymptotic theory.

Example: SNP heritability of adult height

SNP heritability (genetics) \(\longleftrightarrow\) \(R^2\) (statistics)

  • RSS on summary data (# of SNPs 1.1M; sample size 253K):
    52.1%, [50.3%, 53.9%]
  • Linear mixed models on full data (# of SNPs 1.1M; sample size 6K):
    49.8%, [41.2%, 58.4%]

References: Wood et al. (2014); Yang et al. (2011).

Having a likelihood opens the door to various applications in genetics.

RSS Likelihood + Different Priors \(\leadsto\) Genetic Applications

Eg.1: Estimate genetic architecture

Eg.2: Assess gene set enrichment (Today’s Focus)

Eg.3: Partition heritability by annotations

What is gene set enrichment?

  • Phenotype: low-density lipoprotein (Teslovich et al. 2010)
  • Pathway: chylomicron-mediated lipid transport (17 genes)
  • Annotation: Is the SNP near a pathway gene? (yes or no)

Recent Reviews: de Leeuw et al. (2016); Pers (2016); Mooney et al. (2014); Wang et al. (2010).

We develop a method that systematically leverages enrichment information.

We apply the method to 31 phenotypes and 3977 gene sets.

This application is not small:

# of Parameters = 31 \(\times\) (3913+64) \(\times\) 1.1 Million \(\approx\) 136 Billion

  • 31 human phenotypes
  • 3913 biological pathways curated by experts
  • 64 tissue-based gene sets derived from GTEx data
  • 1.1 million common SNPs

One graduate student can get this done, aided by:

  • Publicly available summary data
  • Variational Bayes algorithms
  • Banded matrix approximation
  • Parallel computing
  • Hierarchical data format (HDF5)

Our full results are publicly available.

Results: http://xiangzhu.github.io/rss-gsea/_book
Software: https://github.com/stephenslab/rss

Pathway enrichment mirrors trait category.

Example: Blood lipid & MTTP gene

References: Global Lipids Genetics Consortium (2013); Teslovich et al. (2010).

Example: Alzheimer’s disease & Liver

References: The GTEx Consortium (2015); Lambert et al. (2013).

What did I learn from this project?

Basic concepts help solve modern problems.

  • Bayes’ theorem; Likelihood function

Data integration is a powerful idea.

  • Bayesian hierarchical modelling

Fast computation is important for large-scale applications.

  • Variational inference; Parallel computing; HDF5

Association does not imply causation.

  • Short-term goal: checking bias and confounding
  • Long-term goal: learning “causality” from data

What’s next?

Add more members to the family of RSS models

Develop user-friendly black-box R package

Apply methods on new datasets

We are glad to help!

Acknowledgments

Joint work with M. Stephens (Thesis Advisor)

Discussions with X. He and P. Carbonetto

N. Knoblauch: R package development

M. Turchin: red blood cell data

K. Dey: GTEx clustering results