Processing math: 44%

Bayesian modelling of GWAS summary statistics

Xiang Zhu @ Stephens Lab

02/27/2015

Data: Available to Everyone

Global Lipids 2013

GIANT Height 2014

Data: Generation Workflow

Motivation

Why do we want to work with GWAS summary statistics?

Inevitable information loss when you summarize the individual-level data !!!

We want to use individual-level data, but we don't pass the following test :(

  • Are you a member of a large consortium? If yes, move on …
  • Are you allowed to share your cohort with other members? If yes, move on …
  • Does every member conduct the study in the same way? If yes, move on …
  • Is there enough space to store the individual-level data? If yes, move on …
  • Is there enough resources to analyze the merged data? If yes, move on …
  • And you (a very rare, lucky guy) can skip the rest of talk today :)

Statistical Problem

Problem set-up

  • multiple-SNP model: yn×1=Xn×pβp×1+ϵn×1
  • standard analysis: observe (X,y) and infer β
  • single-SNP summary statistics: effect size estimate ˆβj and standard error sj
  • problem: only observe (ˆβj,sj), how to infer β ?

Exsisting tool: GCTA-COJO

  • approximate COnditional & JOint analysis (Yang et al, Nature genetics, 2012)
  • identified 36 loci with multiple associated variants for height (data: GIANT, 2010)
  • identified 90, 26 and 31 loci containing 2, 3 and >= 4 signals respectively (data: GIANT, 2014)

Novelty of our work

We model the summary statistics, rather than using them to approximate the standard analysis (Yang et al, 2012).

Assumptions

  1. Phenotype adjustment/transformation: ya.s.N(0,σ2y In)
  2. Column-centered genotype matrix: ¯Xj=0,  j=1,,p
  3. Multi-SNP model for individual-level data: y|X,β,τN(Xβ,τ1In)
  4. Single-SNP summary statistics are computed from (X,y) : ˆβj:=(XjXj)1Xjy,  s2j:=(nXjXj)1(yXjˆβj)(yXjˆβj)
  5. "Bless of small per-SNP heritability ": s2j=(XjXj)1σ2y

Caveat

  • Assumption 4: most of summary statistics come from GWAS meta-analysis, NOT mega-analysis
  • Assumption 5: small (?) phenotypic variation explained by one SNP j and all SNPs that SNP j tags yy(yXjˆβj)(yXjˆβj)    s2j(nXTjXj)1yylarge n(XTjXj)1σ2y

Verify: Meta == Mega in GWAS ?

  • Olkin and Sampson (Biometrics, 1998): continuous outcome; constant treatment effects and error variances across clinical trials
  • Mathew and Nordstrom (Biometrics, 1999): different error variances across trials
  • Lin and Zeng (Genetic Epidemiology, 2010): empirical illustration of equivalence using real data; binary outcome
  • Lin and Zeng (Biometrika, 2010): theoretical results about the asymptotic equivalence

Lin and Zeng (Genetic Epidemiology, 2010)

Verify "Bless of per-SNP heritability": s2j=(XTjXj)1σ2y

For each SNP j, define ηj=njs2j/(njs2j+ˆβ2j).

For large sample size, ˆσ2y:=yTy/njσ2y and thus, ηj=s2jyTynjs2j+ˆβ2jnjyTy=s2jXTjXjˆσ2ys2jXTjXjσ2y.
If "Bless of per-SNP heritability" holds, each ηj should be approximately 1. Let's look at log10(1ηj) in several real GWAS.

Model: Overview

ˆβ|β,S,RN(SRS1β,SRS)

  • ˆβ:=(ˆβ1,,ˆβp) , where ˆβj is the estimated single-SNP effect for SNP j
  • S:=Diag(s1,,sp) , where sj is the standard error of ˆβj
  • R is an estimated SNP-correlation matrix (i.e. linkage disequilibrium, LD)
  • β:=(β1,,βp), where βj is the multi-SNP effect size for SNP j
We term this model Regression with Summary Statistics (RSS).

Model: Overview

ˆβ|β,S,RN(SRS1β,SRS)

Mean vector of ˆβ

ˆβj includes the effects of all SNPs that it tags E(ˆβj|β,S,R)=sjpi=1Rijs1iβi

Covariance matrix of ˆβ

ˆβj and ˆβk are correlated if SNP j and k are in LD Cov(ˆβj,ˆβk|β,S,R)=sjskRjk

Model: Sanity Check–Moments

R12S1(ˆβSRS1β)|β,S,RN(0,I)

Model: Sanity Check–Normality

R12S1(ˆβSRS1β)|β,S,RN(0,I)

Model: Derivation

Step 1: Justify conditioning modelling

p(β|y,X)=p(β|ˆβ,S,Rs)    (sufficiency of GWAS summary statistics)p(ˆβ,S,Rs|β)p(β)    (Bayes' Theorem)=p(ˆβ|S,Rs,β)p(S,Rs|β)p(β)p(ˆβ|S,Rs,β)p(β)    (

Remarks:

  • conditioning modelling: first used in classic linear regression (Fisher, 1925)
  • R^{\sf s}: the sample SNP-correlation matrix of (unobserved) genotype X collected in the study cohort
  • sufficiency theory: motivated by Assumption 2 in Adaptive Shrinkage (stephens999/ash)
  • p(S,R^{\sf s}|{\beta})=P(S,R^{\sf s}) is due to the "Bless of small per-SNP heritability": s_j^2 = (X_j^{\sf T}X_j)^{-1}\sigma_y^2

Model: Derivation

Step 2: Specify conditional distribution

Assume a multi-SNP model: {\bf y}|X,{\beta},\tau \sim {\cal N}(X\beta, \tau^{-1}I_n)

Invoke the "Bless of small per-SNP heritability": {\sf E}(\widehat{\beta}|X,{\beta},\tau) = SR^{\sf s}S^{-1}{\beta},~~{\sf Var}(\widehat{\beta}|X,{\beta},\tau) = (\tau\sigma_y^2)^{-1}SR^{\sf s}S

Under the null of multi-SNP model, i.e. \beta\equiv 0 : {\sigma_y^2}={\sf Var}({\bf y}|\tau)={\sf Var}({\bf y}|X,{\beta\equiv 0},\tau)=\tau^{-1}

Approximate p(\widehat{\beta}|S, R^{\sf s}, {\beta}) by a multivariate normal distribution: \widehat{\beta}|S, R^{\sf s}, {\beta}\sim{\cal N}(SR^{\sf s}S^{-1}\beta, SR^{\sf s}S)

Model: Derivation

Step 3: Deal with unknown R^{\sf s}

High-level description

  • Use public genotypes (e.g. HapMap, 1000 Genome) to approximate R^{\sf s}
  • "Plug-in" approach: replace the unknown R^{\sf s} with the approximation R

Interesting finding

A good "guess" of R^{\sf s} is also a good "guess" of population-level LD.

Practical choice: Wen & Stephens (AOAS, 2010)

  • shrinkage estimator based on the genetic map and reference sample size
  • convenient computational property: a sparse banded matrix

Model: Connection with Existing Methods

LD score regression (Nature genetics, 2015) for unstructured sample: {\sf E}(\chi^2_j)=1+n{h^2}\ell_j/{p},~~\ell_j:=\sum_{k=1}^p r_{jk}^2~(\mbox{LD score of variant}~j)

Under a polygenic model \beta\sim{\cal N}(0, (h^2/p)I_p), the marginal likelihood of RSS is given by {\widehat\beta}\sim{\cal N}(0,~SRS+(nh^2/p)SR^2S)

Rewrite the marginal likelihood in terms of z-score, Z_j=\widehat{\beta}_j/s_j and {\bf Z}=S^{-1}{\widehat\beta} {\bf Z} \sim{\cal N}(0, R+(nh^2/p)R^2)

{\sf E}(\chi_j^2)={\sf E}(Z_j^2)={\sf Var}(Z_j)+{\sf E}^2(Z_j)=1+n{h^2}\ell_j/{p}

{\sf Var}(\chi_j^2)={\sf Var}(Z_j^2)={\sf E}(Z_j^4)-{\sf E}^2(Z_j^2)=2+2n{h^2}\ell_j/{p} Two things that they noticed but did not account for in LD score regression:

  • \chi^2 statistics at SNPs in LD are correlated
  • the variance of \chi^2 statistics increases with LD score

Model: Prior

Our goal:

We observe only the summary statistics (\widehat{\beta}, S) and try to draw posterior inference on \beta given certain types of prior.

WTCCC (Nature, 2007):

"for any given trait, there will be few (if any) large effects, a handful of modest effects and a substantial number of genes generating small or very small increases in disease risk".

Two types of prior:

  • Bayesian sparse linear mixed model (BSLMM) prior
  • Adaptive shrinkage (ASH) prior

Model: BSLMM prior

  • \beta_j \sim \pi~{\cal N}(0, \sigma_p^2+\sigma_\beta^2) + (1-\pi)~{\cal N}(0, \sigma_p^2)
  • h := (\sigma_p^2+\pi\sigma_\beta^2)\cdot \sum_{j=1}^p({n_js_j^2})^{-1},~~\rho := {\pi\sigma_\beta^2}/({\sigma_p^2+\pi\sigma_\beta^2})
  • \log\pi \sim {\sf Unif}(\log(1/p),\log 1),~~h,~\rho \sim {\sf Unif}(0,1)

Why do we define h and \rho?

Let V(\cdot) be the sample variance operator. Due to the "Bless of small per-SNP heritability", \small h=\frac{{\sf E}(V(X{\beta})|\pi,\sigma_p^2,\sigma_\beta^2)}{{\sf E}(V({\bf y}))},~~\rho=\frac{{\sf E}(V(X{\beta})|\pi,\sigma_p^2\equiv 0,\sigma_\beta^2)}{{\sf E}(V(X{\beta})|\pi,\sigma_p^2,\sigma_\beta^2)}

Model: ASH prior

  • \beta_j \sim \sum_{k=1}^K\omega_k~{\cal N}(0,\sigma_k^2)
  • \omega \sim {\sf Dirichlet}(\lambda,...\lambda)
  • \lambda \sim {\sf Uniform}(0, 10)

Simplify computation: choose large K and fix the values of \sigma_k^2

Presumably, "silly" value of \sigma_k^2 will be "kicked out" by data in the end.

Applications

Identifying genetic association

  • target region: Bayes Factor (BF), Servin & Stephens (PLoS Genetics, 2007) {\rm H}_0:~\beta\equiv{\bf 0}~\leftrightarrow~{\rm H}_1:~\beta\sim{\cal N}({\bf 0},\sigma^2 I_p),~~~{\sf SBF}=p(\widehat{\beta}|S,R,{\rm H}_1)~/~p(\widehat{\beta}|S,R,{\rm H}_0)
  • whole genome: Posterior Inclusion Probability (PIP), Guan & Stephens (AOAS, 2011) {\sf SPIP}(j)={\sf Pr}(\gamma_j=1|\widehat{\beta},S,R),~{\sf SBF}(j)=\frac{{\sf SPIP}(j)/{\sf Pr}(\gamma_j=1)}{(1-{\sf SPIP}(j))/{\sf Pr}(\gamma_j=0)}

Estimating "chip heritability"

  • PVE: proportion of phenotypic variation explained by available SNPs {\sf PVE}(\beta)={\beta^\prime X^\prime X\beta}/{{\bf y}^\prime{\bf y}}={\sigma_y^2\beta^\prime S^{-1}R^{\sf s}S^{-1}\beta}/({n\widehat{\sigma}_y^2})~\rightarrow~{\sf SPVE}(\beta):= \beta^\prime S^{-1}RS^{-1}\beta/n
  • estimate "chip heritability" h_g^2=\int {\sf SPVE}(\beta)~p(\beta|\widehat{\beta}, S, R){\rm d}\beta

Issues

Qualitative phenotypes

Introduce genetic liability: {\sf Pr}(Y=k)=f_k(L),~~f_k:{\mathbb R}\mapsto [0,1]

  • Individual-based analysis: phenotype Y observed, liability L unobserved
  • Summary-based analysis: phenotype Y observed, liability L unobserved
  • (Y,X)~\rightarrow~(L,X)~\rightarrow~(\widehat{\beta}, S)

Partially overlapped samples

Modify correlation matrix: \widehat{\beta}|S,R,\beta\sim{\cal N}(SRS^{-1}\beta, S^2)

Summary

John von Neumann (1947):

"Truth is much too complicated to allow anything but approximations."

During Stage 1 of RSS, we try to work out:

  • How to approximate the sampling distribution of (\widehat{\beta}, S) ?
  • How to approximate the posterior p(\beta|\widehat{\beta},S,R) ?
  • How to approximate some quantities of interest (e.g. PVE) ?

Software:

https://github.com/stephenslab/rss (still internal)

  • MCMC is currently implemented to simulate posterior draws
  • ? Scalable algorithms: variational inference, expectation propagation etc

On-going: Pathway / Functional Annotations

  • biological pathway enrichment/depletion in association
    \beta_j\sim(1-\pi_j)\delta_0+\pi_j{\cal N}(0,\sigma^2),~~{\sf logit}(\pi_j)=\theta_0+\theta\cdot{\bf 1}\{\mbox{SNP}j~\mbox{is in the pathway}\}
  • functional-category-specific enrichment/depletion in heritability
    \beta_j\sim{\cal N}(0,\sigma_j^2),~~\log(\sigma_j^2)=a_0+\sum_{g=1}^G a_g\cdot{\bf 1}\{\mbox{SNP}j~\mbox{belongs to category }g\}

Carbonetto and Stephens (PLoS Genetics, 2013): analyzed individual-level data of WTCCC

Next Step: Multiple Correlated Traits

Global Lipids (Nature genetics, 2013):

Overlap of 157 lipid-related loci

Association with metabolic & cardiovascular traits

  • coronary artery disease: 40 loci
  • body mass index: 32 loci
  • diastolic blood pressure: 29 loci
  • waist-hip ratio: 27 loci

Acknowledgement

Thesis Advisor:

Dr. Matthew Stephens @Uchicago

Discussions:

Dr. Xin He @Uchicago, Dr. Yongtao Guan @BCM,

Data in PPS cluster:

Dr. Peter Carbonetto @Ancestry.com

Helpful Feedback:

Current and Previous Members @Stephens Lab

Thank you !