02/27/2015

Data: Available to Everyone

Global Lipids 2013

GIANT Height 2014

Data: Generation Workflow

Motivation

Why do we want to work with GWAS summary statistics?

Inevitable information loss when you summarize the individual-level data !!!

We want to use individual-level data, but we don't pass the following test :(

  • Are you a member of a large consortium? If yes, move on …
  • Are you allowed to share your cohort with other members? If yes, move on …
  • Does every member conduct the study in the same way? If yes, move on …
  • Is there enough space to store the individual-level data? If yes, move on …
  • Is there enough resources to analyze the merged data? If yes, move on …
  • And you (a very rare, lucky guy) can skip the rest of talk today :)

Statistical Problem

Problem set-up

  • multiple-SNP model: \({\bf y}_{n\times 1}=X_{n\times p}{\beta}_{p\times 1}+\epsilon_{n\times 1}\)
  • standard analysis: observe \((X,{\bf y})\) and infer \(\beta\)
  • single-SNP summary statistics: effect size estimate \(\widehat{\beta}_j\) and standard error \(s_j\)
  • problem: only observe \((\widehat{\beta}_j, s_j)\), how to infer \(\beta\) ?

Exsisting tool: GCTA-COJO

  • approximate COnditional & JOint analysis (Yang et al, Nature genetics, 2012)
  • identified 36 loci with multiple associated variants for height (data: GIANT, 2010)
  • identified 90, 26 and 31 loci containing 2, 3 and >= 4 signals respectively (data: GIANT, 2014)

Novelty of our work

We model the summary statistics, rather than using them to approximate the standard analysis (Yang et al, 2012).

Assumptions

  1. Phenotype adjustment/transformation: \({\bf y}\stackrel{\mbox{a.s.}}\sim{\cal N}({\bf 0}, \sigma_y^2~I_n)\)
  2. Column-centered genotype matrix: \(\overline{X}_j = 0,~~j=1,\ldots, p\)
  3. Multi-SNP model for individual-level data: \({\bf y}|X,\beta,\tau \sim{\cal N}(X\beta, \tau^{-1}I_n)\)
  4. Single-SNP summary statistics are computed from \((X, {\bf y})\) : \[\widehat{\beta}_j:=(X_j^\prime X_j)^{-1}X_j^\prime{\bf y},~~s_j^2:= (nX_j^\prime X_j)^{-1}{({\bf y}-X_j\widehat{\beta}_j)^{\prime}({\bf y}-X_j\widehat{\beta}_j)}\]
  5. "Bless of small per-SNP heritability ": \(s_j^2=(X_j^\prime X_j)^{-1}\sigma_y^2\)

Caveat

  • Assumption 4: most of summary statistics come from GWAS meta-analysis, NOT mega-analysis
  • Assumption 5: small (?) phenotypic variation explained by one SNP \(j\) and all SNPs that SNP \(j\) tags \[{\bf y}^{\prime}{\bf y}\approx ({\bf y}-X_j\widehat{\beta}_j)^{\prime}({\bf y}-X_j\widehat{\beta}_j)~~\Rightarrow~~s_j^2 \approx(nX_j^{\sf T}X_j)^{-1}{\bf y}^{\prime}{\bf y}\stackrel{\text{large}~n}\approx (X_j^{\sf T}X_j)^{-1}\sigma_y^2\]

Verify: Meta == Mega in GWAS ?

  • Olkin and Sampson (Biometrics, 1998): continuous outcome; constant treatment effects and error variances across clinical trials
  • Mathew and Nordstrom (Biometrics, 1999): different error variances across trials
  • Lin and Zeng (Genetic Epidemiology, 2010): empirical illustration of equivalence using real data; binary outcome
  • Lin and Zeng (Biometrika, 2010): theoretical results about the asymptotic equivalence

Lin and Zeng (Genetic Epidemiology, 2010)

Verify "Bless of per-SNP heritability": \(s_j^2 = (X_j^{\sf T}X_j)^{-1}\sigma_y^2\)

For each SNP \(j\), define \(\eta_j={n_js_j^2}/({n_js_j^2 +\hat{\beta}_j^2}).\)

For large sample size, \(\hat{\sigma}^2_y:={\bf y}^{\sf T}{\bf y}/n_j \approx \sigma_y^2\) and thus, \[\eta_j=s_j^2 \cdot \frac{{\bf y}^{\sf T}{\bf y}}{n_js_j^2 +\hat{\beta}_j^2}\cdot \frac{n_j}{{\bf y}^{\sf T}{\bf y}}=\frac{{s_j^2}\cdot {X_j^{\sf T}X_j}}{\hat{\sigma}^2_y}\approx \frac{{s_j^2}\cdot {X_j^{\sf T}X_j}}{{\sigma}^2_y}.\]
If "Bless of per-SNP heritability" holds, each \(\eta_j\) should be approximately 1. Let's look at \(\log_{10}(1-\eta_j)\) in several real GWAS.

Model: Overview

\[\widehat{\beta}|\beta, S, R \sim {\cal N}(SRS^{-1}{\beta}, SRS)\]

  • \(\widehat{\beta}:=(\widehat{\beta}_1,\ldots,\widehat{\beta}_p)^\prime\) , where \(\widehat{\beta}_j\) is the estimated single-SNP effect for SNP \(j\)
  • \(S:={\rm Diag}(s_1,\ldots, s_p)\) , where \(s_j\) is the standard error of \(\widehat{\beta}_j\)
  • \(R\) is an estimated SNP-correlation matrix (i.e. linkage disequilibrium, LD)
  • \(\beta:=(\beta_1,\ldots,\beta_p)^\prime\), where \(\beta_j\) is the multi-SNP effect size for SNP \(j\)
We term this model Regression with Summary Statistics (RSS).

Model: Overview

\[\widehat{\beta}|\beta, S, R \sim {\cal N}(SRS^{-1}{\beta}, SRS)\]

Mean vector of \(\widehat{\beta}\)

\(\widehat{\beta}_j\) includes the effects of all SNPs that it tags \[{\sf E}(\widehat{\beta}_j|\beta, S, R)=s_j\cdot\sum_{i=1}^pR_{ij}s_i^{-1}\beta_i\]

Covariance matrix of \(\widehat{\beta}\)

\(\widehat{\beta}_j\) and \(\widehat{\beta}_k\) are correlated if SNP \(j\) and \(k\) are in LD \[{\sf Cov}(\widehat{\beta}_j,\widehat{\beta}_k|\beta, S, R)=s_js_kR_{jk}\]

Model: Sanity Check–Moments

\(R^{-\frac{1}{2}}S^{-1}(\widehat{\beta}-SRS^{-1}\beta)|\beta, S, R\sim{\cal N}(0,I)\)

Model: Sanity Check–Normality

\(R^{-\frac{1}{2}}S^{-1}(\widehat{\beta}-SRS^{-1}\beta)|\beta, S, R\sim{\cal N}(0,I)\)

Model: Derivation

Step 1: Justify conditioning modelling

\[ \begin{aligned} p({\beta}|{\bf y}, X) &= p({\beta}|\widehat{\beta}, S, R^{\sf s})~~~~\mbox{(sufficiency of GWAS summary statistics)}\\ &\propto p(\widehat{\beta}, S, R^{\sf s}|{\beta}) p({\beta})~~~~\mbox{(Bayes' Theorem)}\\ &= p(\widehat{\beta}|S, R^{\sf s}, {\beta})p(S,R^{\sf s}|{\beta})p({\beta})\\ &\propto p(\widehat{\beta}|S, R^{\sf s}, {\beta})p(\beta)~~~~(\because p(S,R^{\sf s}|{\beta})=P(S,R^{\sf s})) \end{aligned} \]

Remarks:

  • conditioning modelling: first used in classic linear regression (Fisher, 1925)
  • \(R^{\sf s}\): the sample SNP-correlation matrix of (unobserved) genotype \(X\) collected in the study cohort
  • sufficiency theory: motivated by Assumption 2 in Adaptive Shrinkage (stephens999/ash)
  • \(p(S,R^{\sf s}|{\beta})=P(S,R^{\sf s})\) is due to the "Bless of small per-SNP heritability": \(s_j^2 = (X_j^{\sf T}X_j)^{-1}\sigma_y^2\)

Model: Derivation

Step 2: Specify conditional distribution

Assume a multi-SNP model: \({\bf y}|X,{\beta},\tau \sim {\cal N}(X\beta, \tau^{-1}I_n)\)

Invoke the "Bless of small per-SNP heritability": \[{\sf E}(\widehat{\beta}|X,{\beta},\tau) = SR^{\sf s}S^{-1}{\beta},~~{\sf Var}(\widehat{\beta}|X,{\beta},\tau) = (\tau\sigma_y^2)^{-1}SR^{\sf s}S\]

Under the null of multi-SNP model, i.e. \(\beta\equiv 0\) : \[{\sigma_y^2}={\sf Var}({\bf y}|\tau)={\sf Var}({\bf y}|X,{\beta\equiv 0},\tau)=\tau^{-1}\]

Approximate \(p(\widehat{\beta}|S, R^{\sf s}, {\beta})\) by a multivariate normal distribution: \[\widehat{\beta}|S, R^{\sf s}, {\beta}\sim{\cal N}(SR^{\sf s}S^{-1}\beta, SR^{\sf s}S)\]

Model: Derivation

Step 3: Deal with unknown \(R^{\sf s}\)

High-level description

  • Use public genotypes (e.g. HapMap, 1000 Genome) to approximate \(R^{\sf s}\)
  • "Plug-in" approach: replace the unknown \(R^{\sf s}\) with the approximation \(R\)

Interesting finding

A good "guess" of \(R^{\sf s}\) is also a good "guess" of population-level LD.

Practical choice: Wen & Stephens (AOAS, 2010)

  • shrinkage estimator based on the genetic map and reference sample size
  • convenient computational property: a sparse banded matrix

Model: Connection with Existing Methods

LD score regression (Nature genetics, 2015) for unstructured sample: \[{\sf E}(\chi^2_j)=1+n{h^2}\ell_j/{p},~~\ell_j:=\sum_{k=1}^p r_{jk}^2~(\mbox{LD score of variant}~j)\]

Under a polygenic model \(\beta\sim{\cal N}(0, (h^2/p)I_p)\), the marginal likelihood of RSS is given by \[{\widehat\beta}\sim{\cal N}(0,~SRS+(nh^2/p)SR^2S)\]

Rewrite the marginal likelihood in terms of \(z-\)score, \(Z_j=\widehat{\beta}_j/s_j\) and \({\bf Z}=S^{-1}{\widehat\beta}\) \[{\bf Z} \sim{\cal N}(0, R+(nh^2/p)R^2)\]

\[{\sf E}(\chi_j^2)={\sf E}(Z_j^2)={\sf Var}(Z_j)+{\sf E}^2(Z_j)=1+n{h^2}\ell_j/{p}\]

\[{\sf Var}(\chi_j^2)={\sf Var}(Z_j^2)={\sf E}(Z_j^4)-{\sf E}^2(Z_j^2)=2+2n{h^2}\ell_j/{p}\] Two things that they noticed but did not account for in LD score regression:

  • \(\chi^2\) statistics at SNPs in LD are correlated
  • the variance of \(\chi^2\) statistics increases with LD score

Model: Prior

Our goal:

We observe only the summary statistics \((\widehat{\beta}, S)\) and try to draw posterior inference on \(\beta\) given certain types of prior.

WTCCC (Nature, 2007):

"for any given trait, there will be few (if any) large effects, a handful of modest effects and a substantial number of genes generating small or very small increases in disease risk".

Two types of prior:

  • Bayesian sparse linear mixed model (BSLMM) prior
  • Adaptive shrinkage (ASH) prior

Model: BSLMM prior

  • \(\beta_j \sim \pi~{\cal N}(0, \sigma_p^2+\sigma_\beta^2) + (1-\pi)~{\cal N}(0, \sigma_p^2)\)
  • \(h := (\sigma_p^2+\pi\sigma_\beta^2)\cdot \sum_{j=1}^p({n_js_j^2})^{-1},~~\rho := {\pi\sigma_\beta^2}/({\sigma_p^2+\pi\sigma_\beta^2})\)
  • \(\log\pi \sim {\sf Unif}(\log(1/p),\log 1),~~h,~\rho \sim {\sf Unif}(0,1)\)

Why do we define \(h\) and \(\rho\)?

Let \(V(\cdot)\) be the sample variance operator. Due to the "Bless of small per-SNP heritability", \[\small h=\frac{{\sf E}(V(X{\beta})|\pi,\sigma_p^2,\sigma_\beta^2)}{{\sf E}(V({\bf y}))},~~\rho=\frac{{\sf E}(V(X{\beta})|\pi,\sigma_p^2\equiv 0,\sigma_\beta^2)}{{\sf E}(V(X{\beta})|\pi,\sigma_p^2,\sigma_\beta^2)}\]

Model: ASH prior

  • \(\beta_j \sim \sum_{k=1}^K\omega_k~{\cal N}(0,\sigma_k^2)\)
  • \(\omega \sim {\sf Dirichlet}(\lambda,...\lambda)\)
  • \(\lambda \sim {\sf Uniform}(0, 10)\)

Simplify computation: choose large \(K\) and fix the values of \(\sigma_k^2\)

Presumably, "silly" value of \(\sigma_k^2\) will be "kicked out" by data in the end.

Applications

Identifying genetic association

  • target region: Bayes Factor (BF), Servin & Stephens (PLoS Genetics, 2007) \[{\rm H}_0:~\beta\equiv{\bf 0}~\leftrightarrow~{\rm H}_1:~\beta\sim{\cal N}({\bf 0},\sigma^2 I_p),~~~{\sf SBF}=p(\widehat{\beta}|S,R,{\rm H}_1)~/~p(\widehat{\beta}|S,R,{\rm H}_0)\]
  • whole genome: Posterior Inclusion Probability (PIP), Guan & Stephens (AOAS, 2011) \[{\sf SPIP}(j)={\sf Pr}(\gamma_j=1|\widehat{\beta},S,R),~{\sf SBF}(j)=\frac{{\sf SPIP}(j)/{\sf Pr}(\gamma_j=1)}{(1-{\sf SPIP}(j))/{\sf Pr}(\gamma_j=0)}\]

Estimating "chip heritability"

  • PVE: proportion of phenotypic variation explained by available SNPs \[{\sf PVE}(\beta)={\beta^\prime X^\prime X\beta}/{{\bf y}^\prime{\bf y}}={\sigma_y^2\beta^\prime S^{-1}R^{\sf s}S^{-1}\beta}/({n\widehat{\sigma}_y^2})~\rightarrow~{\sf SPVE}(\beta):= \beta^\prime S^{-1}RS^{-1}\beta/n\]
  • estimate "chip heritability" \(h_g^2=\int {\sf SPVE}(\beta)~p(\beta|\widehat{\beta}, S, R){\rm d}\beta\)

Issues

Qualitative phenotypes

Introduce genetic liability: \({\sf Pr}(Y=k)=f_k(L),~~f_k:{\mathbb R}\mapsto [0,1]\)

  • Individual-based analysis: phenotype \(Y\) observed, liability \(L\) unobserved
  • Summary-based analysis: phenotype \(Y\) observed, liability \(L\) unobserved
  • \((Y,X)~\rightarrow~(L,X)~\rightarrow~(\widehat{\beta}, S)\)

Partially overlapped samples

Modify correlation matrix: \(\widehat{\beta}|S,R,\beta\sim{\cal N}(SRS^{-1}\beta, S^2)\)

Summary

John von Neumann (1947):

"Truth is much too complicated to allow anything but approximations."

During Stage 1 of RSS, we try to work out:

  • How to approximate the sampling distribution of \((\widehat{\beta}, S)\) ?
  • How to approximate the posterior \(p(\beta|\widehat{\beta},S,R)\) ?
  • How to approximate some quantities of interest (e.g. PVE) ?

Software:

https://github.com/stephenslab/rss (still internal)

  • MCMC is currently implemented to simulate posterior draws
  • ? Scalable algorithms: variational inference, expectation propagation etc

On-going: Pathway / Functional Annotations

  • biological pathway enrichment/depletion in association
    \[\beta_j\sim(1-\pi_j)\delta_0+\pi_j{\cal N}(0,\sigma^2),~~{\sf logit}(\pi_j)=\theta_0+\theta\cdot{\bf 1}\{\mbox{SNP}j~\mbox{is in the pathway}\}\]
  • functional-category-specific enrichment/depletion in heritability
    \[\beta_j\sim{\cal N}(0,\sigma_j^2),~~\log(\sigma_j^2)=a_0+\sum_{g=1}^G a_g\cdot{\bf 1}\{\mbox{SNP}j~\mbox{belongs to category }g\}\]

Carbonetto and Stephens (PLoS Genetics, 2013): analyzed individual-level data of WTCCC

Next Step: Multiple Correlated Traits

Global Lipids (Nature genetics, 2013):

Overlap of 157 lipid-related loci

Association with metabolic & cardiovascular traits

  • coronary artery disease: 40 loci
  • body mass index: 32 loci
  • diastolic blood pressure: 29 loci
  • waist-hip ratio: 27 loci

Acknowledgement

Thesis Advisor:

Dr. Matthew Stephens @Uchicago

Discussions:

Dr. Xin He @Uchicago, Dr. Yongtao Guan @BCM,

Data in PPS cluster:

Dr. Peter Carbonetto @Ancestry.com

Helpful Feedback:

Current and Previous Members @Stephens Lab

Thank you !