Global Lipids 2013
GIANT Height 2014
Xiang Zhu @ Stephens Lab
02/27/2015
Global Lipids 2013
GIANT Height 2014
Why do we want to work with GWAS summary statistics?
Inevitable information loss when you summarize the individual-level data !!!
We want to use individual-level data, but we don't pass the following test :(
Lin and Zeng (Genetic Epidemiology, 2010)
For each SNP j, define ηj=njs2j/(njs2j+ˆβ2j).
For large sample size, ˆσ2y:=yTy/nj≈σ2y and thus, ηj=s2j⋅yTynjs2j+ˆβ2j⋅njyTy=s2j⋅XTjXjˆσ2y≈s2j⋅XTjXjσ2y.ˆβ|β,S,R∼N(SRS−1β,SRS)
ˆβ|β,S,R∼N(SRS−1β,SRS)
ˆβj includes the effects of all SNPs that it tags E(ˆβj|β,S,R)=sj⋅p∑i=1Rijs−1iβi
Remarks:
stephens999/ash
)Assume a multi-SNP model: {\bf y}|X,{\beta},\tau \sim {\cal N}(X\beta, \tau^{-1}I_n)
Invoke the "Bless of small per-SNP heritability": {\sf E}(\widehat{\beta}|X,{\beta},\tau) = SR^{\sf s}S^{-1}{\beta},~~{\sf Var}(\widehat{\beta}|X,{\beta},\tau) = (\tau\sigma_y^2)^{-1}SR^{\sf s}S
Under the null of multi-SNP model, i.e. \beta\equiv 0 : {\sigma_y^2}={\sf Var}({\bf y}|\tau)={\sf Var}({\bf y}|X,{\beta\equiv 0},\tau)=\tau^{-1}
Approximate p(\widehat{\beta}|S, R^{\sf s}, {\beta}) by a multivariate normal distribution: \widehat{\beta}|S, R^{\sf s}, {\beta}\sim{\cal N}(SR^{\sf s}S^{-1}\beta, SR^{\sf s}S)A good "guess" of R^{\sf s} is also a good "guess" of population-level LD.
LD score regression (Nature genetics, 2015) for unstructured sample: {\sf E}(\chi^2_j)=1+n{h^2}\ell_j/{p},~~\ell_j:=\sum_{k=1}^p r_{jk}^2~(\mbox{LD score of variant}~j)
Under a polygenic model \beta\sim{\cal N}(0, (h^2/p)I_p), the marginal likelihood of RSS is given by {\widehat\beta}\sim{\cal N}(0,~SRS+(nh^2/p)SR^2S)
Rewrite the marginal likelihood in terms of z-score, Z_j=\widehat{\beta}_j/s_j and {\bf Z}=S^{-1}{\widehat\beta} {\bf Z} \sim{\cal N}(0, R+(nh^2/p)R^2)
{\sf E}(\chi_j^2)={\sf E}(Z_j^2)={\sf Var}(Z_j)+{\sf E}^2(Z_j)=1+n{h^2}\ell_j/{p}
{\sf Var}(\chi_j^2)={\sf Var}(Z_j^2)={\sf E}(Z_j^4)-{\sf E}^2(Z_j^2)=2+2n{h^2}\ell_j/{p} Two things that they noticed but did not account for in LD score regression:
We observe only the summary statistics (\widehat{\beta}, S) and try to draw posterior inference on \beta given certain types of prior.
"for any given trait, there will be few (if any) large effects, a handful of modest effects and a substantial number of genes generating small or very small increases in disease risk".
Simplify computation: choose large K and fix the values of \sigma_k^2
Presumably, "silly" value of \sigma_k^2 will be "kicked out" by data in the end.Introduce genetic liability: {\sf Pr}(Y=k)=f_k(L),~~f_k:{\mathbb R}\mapsto [0,1]
Modify correlation matrix: \widehat{\beta}|S,R,\beta\sim{\cal N}(SRS^{-1}\beta, S^2)
"Truth is much too complicated to allow anything but approximations."
During Stage 1 of RSS, we try to work out:
https://github.com/stephenslab/rss (still internal)
Carbonetto and Stephens (PLoS Genetics, 2013): analyzed individual-level data of WTCCC
Global Lipids (Nature genetics, 2013):
Overlap of 157 lipid-related loci
Association with metabolic & cardiovascular traits
Dr. Matthew Stephens @Uchicago
Dr. Xin He @Uchicago, Dr. Yongtao Guan @BCM,
Dr. Peter Carbonetto @Ancestry.com
Current and Previous Members @Stephens Lab