Global Lipids 2013
GIANT Height 2014
02/27/2015
Global Lipids 2013
GIANT Height 2014
Why do we want to work with GWAS summary statistics?
Inevitable information loss when you summarize the individual-level data !!!
We want to use individual-level data, but we don't pass the following test :(
Lin and Zeng (Genetic Epidemiology, 2010)
For each SNP \(j\), define \(\eta_j={n_js_j^2}/({n_js_j^2 +\hat{\beta}_j^2}).\)
For large sample size, \(\hat{\sigma}^2_y:={\bf y}^{\sf T}{\bf y}/n_j \approx \sigma_y^2\) and thus, \[\eta_j=s_j^2 \cdot \frac{{\bf y}^{\sf T}{\bf y}}{n_js_j^2 +\hat{\beta}_j^2}\cdot \frac{n_j}{{\bf y}^{\sf T}{\bf y}}=\frac{{s_j^2}\cdot {X_j^{\sf T}X_j}}{\hat{\sigma}^2_y}\approx \frac{{s_j^2}\cdot {X_j^{\sf T}X_j}}{{\sigma}^2_y}.\]\[\widehat{\beta}|\beta, S, R \sim {\cal N}(SRS^{-1}{\beta}, SRS)\]
\[\widehat{\beta}|\beta, S, R \sim {\cal N}(SRS^{-1}{\beta}, SRS)\]
\(\widehat{\beta}_j\) includes the effects of all SNPs that it tags \[{\sf E}(\widehat{\beta}_j|\beta, S, R)=s_j\cdot\sum_{i=1}^pR_{ij}s_i^{-1}\beta_i\]
Remarks:
stephens999/ash
)Assume a multi-SNP model: \({\bf y}|X,{\beta},\tau \sim {\cal N}(X\beta, \tau^{-1}I_n)\)
Invoke the "Bless of small per-SNP heritability": \[{\sf E}(\widehat{\beta}|X,{\beta},\tau) = SR^{\sf s}S^{-1}{\beta},~~{\sf Var}(\widehat{\beta}|X,{\beta},\tau) = (\tau\sigma_y^2)^{-1}SR^{\sf s}S\]
Under the null of multi-SNP model, i.e. \(\beta\equiv 0\) : \[{\sigma_y^2}={\sf Var}({\bf y}|\tau)={\sf Var}({\bf y}|X,{\beta\equiv 0},\tau)=\tau^{-1}\]
Approximate \(p(\widehat{\beta}|S, R^{\sf s}, {\beta})\) by a multivariate normal distribution: \[\widehat{\beta}|S, R^{\sf s}, {\beta}\sim{\cal N}(SR^{\sf s}S^{-1}\beta, SR^{\sf s}S)\]A good "guess" of \(R^{\sf s}\) is also a good "guess" of population-level LD.
LD score regression (Nature genetics, 2015) for unstructured sample: \[{\sf E}(\chi^2_j)=1+n{h^2}\ell_j/{p},~~\ell_j:=\sum_{k=1}^p r_{jk}^2~(\mbox{LD score of variant}~j)\]
Under a polygenic model \(\beta\sim{\cal N}(0, (h^2/p)I_p)\), the marginal likelihood of RSS is given by \[{\widehat\beta}\sim{\cal N}(0,~SRS+(nh^2/p)SR^2S)\]
Rewrite the marginal likelihood in terms of \(z-\)score, \(Z_j=\widehat{\beta}_j/s_j\) and \({\bf Z}=S^{-1}{\widehat\beta}\) \[{\bf Z} \sim{\cal N}(0, R+(nh^2/p)R^2)\]
\[{\sf E}(\chi_j^2)={\sf E}(Z_j^2)={\sf Var}(Z_j)+{\sf E}^2(Z_j)=1+n{h^2}\ell_j/{p}\]
\[{\sf Var}(\chi_j^2)={\sf Var}(Z_j^2)={\sf E}(Z_j^4)-{\sf E}^2(Z_j^2)=2+2n{h^2}\ell_j/{p}\] Two things that they noticed but did not account for in LD score regression:
We observe only the summary statistics \((\widehat{\beta}, S)\) and try to draw posterior inference on \(\beta\) given certain types of prior.
"for any given trait, there will be few (if any) large effects, a handful of modest effects and a substantial number of genes generating small or very small increases in disease risk".
Simplify computation: choose large \(K\) and fix the values of \(\sigma_k^2\)
Presumably, "silly" value of \(\sigma_k^2\) will be "kicked out" by data in the end.Introduce genetic liability: \({\sf Pr}(Y=k)=f_k(L),~~f_k:{\mathbb R}\mapsto [0,1]\)
Modify correlation matrix: \(\widehat{\beta}|S,R,\beta\sim{\cal N}(SRS^{-1}\beta, S^2)\)
"Truth is much too complicated to allow anything but approximations."
During Stage 1 of RSS, we try to work out:
https://github.com/stephenslab/rss (still internal)
Carbonetto and Stephens (PLoS Genetics, 2013): analyzed individual-level data of WTCCC
Global Lipids (Nature genetics, 2013):
Overlap of 157 lipid-related loci
Association with metabolic & cardiovascular traits
Dr. Matthew Stephens @Uchicago
Dr. Xin He @Uchicago, Dr. Yongtao Guan @BCM,
Dr. Peter Carbonetto @Ancestry.com
Current and Previous Members @Stephens Lab