Processing math: 44%

Bayesian modelling of GWAS summary statistics

Xiang Zhu @ Stephens Lab

02/27/2015

Data: Available to Everyone

Global Lipids 2013

GIANT Height 2014

Data: Generation Workflow

Motivation

Why do we want to work with GWAS summary statistics?

Inevitable information loss when you summarize the individual-level data !!!

We want to use individual-level data, but we don't pass the following test :(

Are you a member of a large consortium? If yes, move on …
Are you allowed to share your cohort with other members? If yes, move on …
Does every member conduct the study in the same way? If yes, move on …
Is there enough space to store the individual-level data? If yes, move on …
Is there enough resources to analyze the merged data? If yes, move on …
And you (a very rare, lucky guy) can skip the rest of talk today :)

Statistical Problem

Problem set-up

multiple-SNP model: ${\bf y}_{n\times 1}=X_{n\times p}{\beta}_{p\times 1}+\epsilon_{n\times 1}$
standard analysis: observe $(X,{\bf y})$ and infer $\beta$
single-SNP summary statistics: effect size estimate $\widehat{\beta}_j$ and standard error $s_j$
problem: only observe $(\widehat{\beta}_j, s_j)$ , how to infer $\beta$ ?

Exsisting tool: GCTA-COJO

approximate COnditional & JOint analysis (Yang et al, Nature genetics, 2012)
identified 36 loci with multiple associated variants for height (data: GIANT, 2010)
identified 90, 26 and 31 loci containing 2, 3 and >= 4 signals respectively (data: GIANT, 2014)

Novelty of our work

We model the summary statistics, rather than using them to approximate the standard analysis (Yang et al, 2012).

Assumptions

Phenotype adjustment/transformation: ${\bf y}\stackrel{\mbox{a.s.}}\sim{\cal N}({\bf 0}, \sigma_y^2~I_n)$
Column-centered genotype matrix: $\overline{X}_j = 0,~~j=1,\ldots, p$
Multi-SNP model for individual-level data: ${\bf y}|X,\beta,\tau \sim{\cal N}(X\beta, \tau^{-1}I_n)$
Single-SNP summary statistics are computed from $(X, {\bf y})$ : $\widehat{\beta}_j:=(X_j^\prime X_j)^{-1}X_j^\prime{\bf y},~~s_j^2:= (nX_j^\prime X_j)^{-1}{({\bf y}-X_j\widehat{\beta}_j)^{\prime}({\bf y}-X_j\widehat{\beta}_j)}$
"Bless of small per-SNP heritability ": $s_j^2=(X_j^\prime X_j)^{-1}\sigma_y^2$

Caveat

Assumption 4: most of summary statistics come from GWAS meta-analysis, NOT mega-analysis
Assumption 5: small (?) phenotypic variation explained by one SNP $j$ and all SNPs that SNP $j$ tags ${\bf y}^{\prime}{\bf y}\approx ({\bf y}-X_j\widehat{\beta}_j)^{\prime}({\bf y}-X_j\widehat{\beta}_j)~~\Rightarrow~~s_j^2 \approx(nX_j^{\sf T}X_j)^{-1}{\bf y}^{\prime}{\bf y}\stackrel{\text{large}~n}\approx (X_j^{\sf T}X_j)^{-1}\sigma_y^2$

Verify: Meta == Mega in GWAS ?

Olkin and Sampson (Biometrics, 1998): continuous outcome; constant treatment effects and error variances across clinical trials
Mathew and Nordstrom (Biometrics, 1999): different error variances across trials
Lin and Zeng (Genetic Epidemiology, 2010): empirical illustration of equivalence using real data; binary outcome
Lin and Zeng (Biometrika, 2010): theoretical results about the asymptotic equivalence

Lin and Zeng (Genetic Epidemiology, 2010)

Verify "Bless of per-SNP heritability":

$s_j^2 = (X_j^{\sf T}X_j)^{-1}\sigma_y^2$

For each SNP $j$ , define $\eta_j={n_js_j^2}/({n_js_j^2 +\hat{\beta}_j^2}).$

For large sample size,

$\hat{\sigma}^2_y:={\bf y}^{\sf T}{\bf y}/n_j \approx \sigma_y^2$ and thus,

$\eta_j=s_j^2 \cdot \frac{{\bf y}^{\sf T}{\bf y}}{n_js_j^2 +\hat{\beta}_j^2}\cdot \frac{n_j}{{\bf y}^{\sf T}{\bf y}}=\frac{{s_j^2}\cdot {X_j^{\sf T}X_j}}{\hat{\sigma}^2_y}\approx \frac{{s_j^2}\cdot {X_j^{\sf T}X_j}}{{\sigma}^2_y}.$

If "Bless of per-SNP heritability" holds, each

$\eta_j$ should be approximately 1. Let's look at

$\log_{10}(1-\eta_j)$ in several real GWAS.

Model: Overview

$\widehat{\beta}|\beta, S, R \sim {\cal N}(SRS^{-1}{\beta}, SRS)$

$\widehat{\beta}:=(\widehat{\beta}_1,\ldots,\widehat{\beta}_p)^\prime$ , where $\widehat{\beta}_j$ is the estimated single-SNP effect for SNP $j$
$S:={\rm Diag}(s_1,\ldots, s_p)$ , where $s_j$ is the standard error of $\widehat{\beta}_j$
$R$ is an estimated SNP-correlation matrix (i.e. linkage disequilibrium, LD)
$\beta:=(\beta_1,\ldots,\beta_p)^\prime$ , where $\beta_j$ is the multi-SNP effect size for SNP $j$

We term this model Regression with Summary Statistics (RSS).

Model: Overview

$\widehat{\beta}|\beta, S, R \sim {\cal N}(SRS^{-1}{\beta}, SRS)$

Mean vector of $\widehat{\beta}$

$\widehat{\beta}_j$ includes the effects of all SNPs that it tags ${\sf E}(\widehat{\beta}_j|\beta, S, R)=s_j\cdot\sum_{i=1}^pR_{ij}s_i^{-1}\beta_i$

Covariance matrix of $\widehat{\beta}$

$\widehat{\beta}_j$ and

$\widehat{\beta}_k$ are correlated if SNP

$j$ and

$k$ are in LD

${\sf Cov}(\widehat{\beta}_j,\widehat{\beta}_k|\beta, S, R)=s_js_kR_{jk}$

Model: Sanity Check–Moments

$R^{-\frac{1}{2}}S^{-1}(\widehat{\beta}-SRS^{-1}\beta)|\beta, S, R\sim{\cal N}(0,I)$

Model: Sanity Check–Normality

$R^{-\frac{1}{2}}S^{-1}(\widehat{\beta}-SRS^{-1}\beta)|\beta, S, R\sim{\cal N}(0,I)$

Model: Derivation

Step 1: Justify conditioning modelling

$\begin{aligned} p({\beta}|{\bf y}, X) &= p({\beta}|\widehat{\beta}, S, R^{\sf s})~~~~\mbox{(sufficiency of GWAS summary statistics)}\\ &\propto p(\widehat{\beta}, S, R^{\sf s}|{\beta}) p({\beta})~~~~\mbox{(Bayes' Theorem)}\\ &= p(\widehat{\beta}|S, R^{\sf s}, {\beta})p(S,R^{\sf s}|{\beta})p({\beta})\\ &\propto p(\widehat{\beta}|S, R^{\sf s}, {\beta})p(\beta)~~~~(\because p(S,R^{\sf s}|{\beta})=P(S,R^{\sf s})) \end{aligned}$

Remarks:

conditioning modelling: first used in classic linear regression (Fisher, 1925)
$R^{\sf s}$ : the sample SNP-correlation matrix of (unobserved) genotype $X$ collected in the study cohort
sufficiency theory: motivated by Assumption 2 in Adaptive Shrinkage (stephens999/ash)
$p(S,R^{\sf s}|{\beta})=P(S,R^{\sf s})$ is due to the "Bless of small per-SNP heritability": $s_j^2 = (X_j^{\sf T}X_j)^{-1}\sigma_y^2$

Model: Derivation

Step 2: Specify conditional distribution

Assume a multi-SNP model: ${\bf y}|X,{\beta},\tau \sim {\cal N}(X\beta, \tau^{-1}I_n)$

Invoke the "Bless of small per-SNP heritability": ${\sf E}(\widehat{\beta}|X,{\beta},\tau) = SR^{\sf s}S^{-1}{\beta},~~{\sf Var}(\widehat{\beta}|X,{\beta},\tau) = (\tau\sigma_y^2)^{-1}SR^{\sf s}S$

Under the null of multi-SNP model, i.e. $\beta\equiv 0$ : ${\sigma_y^2}={\sf Var}({\bf y}|\tau)={\sf Var}({\bf y}|X,{\beta\equiv 0},\tau)=\tau^{-1}$

Approximate

$p(\widehat{\beta}|S, R^{\sf s}, {\beta})$ by a multivariate normal distribution:

$\widehat{\beta}|S, R^{\sf s}, {\beta}\sim{\cal N}(SR^{\sf s}S^{-1}\beta, SR^{\sf s}S)$

Model: Derivation

Step 3: Deal with unknown $R^{\sf s}$

High-level description

Use public genotypes (e.g. HapMap, 1000 Genome) to approximate $R^{\sf s}$
"Plug-in" approach: replace the unknown $R^{\sf s}$ with the approximation $R$

Interesting finding

A good "guess" of $R^{\sf s}$ is also a good "guess" of population-level LD.

Practical choice: Wen & Stephens (AOAS, 2010)

shrinkage estimator based on the genetic map and reference sample size
convenient computational property: a sparse banded matrix

Model: Connection with Existing Methods

LD score regression (Nature genetics, 2015) for unstructured sample: ${\sf E}(\chi^2_j)=1+n{h^2}\ell_j/{p},~~\ell_j:=\sum_{k=1}^p r_{jk}^2~(\mbox{LD score of variant}~j)$

Under a polygenic model $\beta\sim{\cal N}(0, (h^2/p)I_p)$ , the marginal likelihood of RSS is given by ${\widehat\beta}\sim{\cal N}(0,~SRS+(nh^2/p)SR^2S)$

Rewrite the marginal likelihood in terms of $z-$ score, $Z_j=\widehat{\beta}_j/s_j$ and ${\bf Z}=S^{-1}{\widehat\beta}$ ${\bf Z} \sim{\cal N}(0, R+(nh^2/p)R^2)$

${\sf E}(\chi_j^2)={\sf E}(Z_j^2)={\sf Var}(Z_j)+{\sf E}^2(Z_j)=1+n{h^2}\ell_j/{p}$

${\sf Var}(\chi_j^2)={\sf Var}(Z_j^2)={\sf E}(Z_j^4)-{\sf E}^2(Z_j^2)=2+2n{h^2}\ell_j/{p}$ Two things that they noticed but did not account for in LD score regression:

$\chi^2$ statistics at SNPs in LD are correlated
the variance of $\chi^2$ statistics increases with LD score

Model: Prior

Our goal:

We observe only the summary statistics $(\widehat{\beta}, S)$ and try to draw posterior inference on $\beta$ given certain types of prior.

WTCCC (Nature, 2007):

"for any given trait, there will be few (if any) large effects, a handful of modest effects and a substantial number of genes generating small or very small increases in disease risk".

Two types of prior:

Bayesian sparse linear mixed model (BSLMM) prior
Adaptive shrinkage (ASH) prior

Model: BSLMM prior

$\beta_j \sim \pi~{\cal N}(0, \sigma_p^2+\sigma_\beta^2) + (1-\pi)~{\cal N}(0, \sigma_p^2)$
$h := (\sigma_p^2+\pi\sigma_\beta^2)\cdot \sum_{j=1}^p({n_js_j^2})^{-1},~~\rho := {\pi\sigma_\beta^2}/({\sigma_p^2+\pi\sigma_\beta^2})$
$\log\pi \sim {\sf Unif}(\log(1/p),\log 1),~~h,~\rho \sim {\sf Unif}(0,1)$

Why do we define $h$ and $\rho$ ?

Let

$V(\cdot)$ be the sample variance operator. Due to the "Bless of small per-SNP heritability",

$\small h=\frac{{\sf E}(V(X{\beta})|\pi,\sigma_p^2,\sigma_\beta^2)}{{\sf E}(V({\bf y}))},~~\rho=\frac{{\sf E}(V(X{\beta})|\pi,\sigma_p^2\equiv 0,\sigma_\beta^2)}{{\sf E}(V(X{\beta})|\pi,\sigma_p^2,\sigma_\beta^2)}$

Model: ASH prior

$\beta_j \sim \sum_{k=1}^K\omega_k~{\cal N}(0,\sigma_k^2)$
$\omega \sim {\sf Dirichlet}(\lambda,...\lambda)$
$\lambda \sim {\sf Uniform}(0, 10)$

Simplify computation: choose large $K$ and fix the values of $\sigma_k^2$

Presumably, "silly" value of

$\sigma_k^2$ will be "kicked out" by data in the end.

Applications

Identifying genetic association

target region: Bayes Factor (BF), Servin & Stephens (PLoS Genetics, 2007) ${\rm H}_0:~\beta\equiv{\bf 0}~\leftrightarrow~{\rm H}_1:~\beta\sim{\cal N}({\bf 0},\sigma^2 I_p),~~~{\sf SBF}=p(\widehat{\beta}|S,R,{\rm H}_1)~/~p(\widehat{\beta}|S,R,{\rm H}_0)$
whole genome: Posterior Inclusion Probability (PIP), Guan & Stephens (AOAS, 2011) ${\sf SPIP}(j)={\sf Pr}(\gamma_j=1|\widehat{\beta},S,R),~{\sf SBF}(j)=\frac{{\sf SPIP}(j)/{\sf Pr}(\gamma_j=1)}{(1-{\sf SPIP}(j))/{\sf Pr}(\gamma_j=0)}$

Estimating "chip heritability"

PVE: proportion of phenotypic variation explained by available SNPs ${\sf PVE}(\beta)={\beta^\prime X^\prime X\beta}/{{\bf y}^\prime{\bf y}}={\sigma_y^2\beta^\prime S^{-1}R^{\sf s}S^{-1}\beta}/({n\widehat{\sigma}_y^2})~\rightarrow~{\sf SPVE}(\beta):= \beta^\prime S^{-1}RS^{-1}\beta/n$
estimate "chip heritability" $h_g^2=\int {\sf SPVE}(\beta)~p(\beta|\widehat{\beta}, S, R){\rm d}\beta$

Issues

Qualitative phenotypes

Introduce genetic liability: ${\sf Pr}(Y=k)=f_k(L),~~f_k:{\mathbb R}\mapsto [0,1]$

Individual-based analysis: phenotype $Y$ observed, liability $L$ unobserved
Summary-based analysis: phenotype $Y$ observed, liability $L$ unobserved
$(Y,X)~\rightarrow~(L,X)~\rightarrow~(\widehat{\beta}, S)$

Partially overlapped samples

Modify correlation matrix: $\widehat{\beta}|S,R,\beta\sim{\cal N}(SRS^{-1}\beta, S^2)$

Summary

John von Neumann (1947):

"Truth is much too complicated to allow anything but approximations."

During Stage 1 of RSS, we try to work out:

How to approximate the sampling distribution of $(\widehat{\beta}, S)$ ?
How to approximate the posterior $p(\beta|\widehat{\beta},S,R)$ ?
How to approximate some quantities of interest (e.g. PVE) ?

Software:

https://github.com/stephenslab/rss (still internal)

MCMC is currently implemented to simulate posterior draws
? Scalable algorithms: variational inference, expectation propagation etc

On-going: Pathway / Functional Annotations

biological pathway enrichment/depletion in association
$\beta_j\sim(1-\pi_j)\delta_0+\pi_j{\cal N}(0,\sigma^2),~~{\sf logit}(\pi_j)=\theta_0+\theta\cdot{\bf 1}\{\mbox{SNP}j~\mbox{is in the pathway}\}$

functional-category-specific enrichment/depletion in heritability
$\beta_j\sim{\cal N}(0,\sigma_j^2),~~\log(\sigma_j^2)=a_0+\sum_{g=1}^G a_g\cdot{\bf 1}\{\mbox{SNP}j~\mbox{belongs to category }g\}$

Carbonetto and Stephens (PLoS Genetics, 2013): analyzed individual-level data of WTCCC

Next Step: Multiple Correlated Traits

Global Lipids (Nature genetics, 2013):

Overlap of 157 lipid-related loci

Association with metabolic & cardiovascular traits

coronary artery disease: 40 loci
body mass index: 32 loci
diastolic blood pressure: 29 loci
waist-hip ratio: 27 loci

Acknowledgement

Thesis Advisor:

Dr. Matthew Stephens @Uchicago

Discussions:

Dr. Xin He @Uchicago, Dr. Yongtao Guan @BCM,

Data in PPS cluster:

Dr. Peter Carbonetto @Ancestry.com

Helpful Feedback:

Current and Previous Members @Stephens Lab

Bayesian modelling of GWAS summary statistics

Data: Available to Everyone

Data: Generation Workflow

Motivation

Statistical Problem

Problem set-up

Exsisting tool: GCTA-COJO

Novelty of our work

Assumptions

Caveat

Verify: Meta == Mega in GWAS ?

Model: Overview

Model: Overview

Mean vector of ˆβ\widehat{\beta}

Covariance matrix of ˆβ\widehat{\beta}

Model: Sanity Check–Moments

Model: Sanity Check–Normality

Model: Derivation

Step 1: Justify conditioning modelling

Model: Derivation

Step 2: Specify conditional distribution

Model: Derivation

Step 3: Deal with unknown R^{\sf s}R^{\sf s}

High-level description

Interesting finding

Practical choice: Wen & Stephens (AOAS, 2010)

Model: Connection with Existing Methods

Model: Prior

Our goal:

WTCCC (Nature, 2007):

Two types of prior:

Model: BSLMM prior

Why do we define hh and \rho\rho?

Model: ASH prior

Applications

Identifying genetic association

Estimating "chip heritability"

Issues

Qualitative phenotypes

Partially overlapped samples

Summary

John von Neumann (1947):

Software:

On-going: Pathway / Functional Annotations

Next Step: Multiple Correlated Traits

Acknowledgement

Thesis Advisor:

Discussions:

Data in PPS cluster:

Helpful Feedback:

Thank you !

Mean vector of $\widehat{\beta}$

Covariance matrix of $\widehat{\beta}$

Step 3: Deal with unknown $R^{\sf s}$

Why do we define $h$ and $\rho$ ?