A Bayesian regression model for genome-wide summary data

Xiang Zhu <xiangzhu@uchicago.edu>

2017-06-30 @ Uchicago Statistics

What are genome-wide summary data?

Data: phenotype \(\sf Y\) and genotype \(\sf X\)

Size: \(\sf n\) (>10K) individuals and \(\sf p\) (>1M) genetic variants (SNPs)

\[ %\scriptsize {\sf Y}:= \left[ \begin{array}{c} {\sf y_1} \\ {\sf y_2} \\ \vdots \\ {\sf y_n} \end{array} \right] ~~~ {\sf X}:= \left[ \begin{array}{ccccc} \mathsf{x_{11}} & \ldots & \mathsf{x_{1j}} & \ldots & \mathsf{x_{1p}}\\ \mathsf{x_{21}} & \ldots & \mathsf{x_{2j}} & \ldots & \mathsf{x_{2p}}\\ \vdots & \ldots & \vdots & \ldots & \vdots \\ \mathsf{x_{n1}} & \ldots & \mathsf{x_{nj}} & \ldots & \mathsf{x_{np}}\end{array} \right] \]

Model: single-SNP association analysis

\[ %\scriptsize \left[ \begin{array}{c} \mathsf{y_{1}} \\ \mathsf{y_{2}} \\ \vdots \\ \mathsf{y_{n}} \end{array} \right] \sim \left[ \begin{array}{c} \mathsf{x_{1j}} \\ \mathsf{x_{2j}} \\ \vdots \\ \mathsf{x_{nj}} \end{array} \right] ~~ \leadsto ~~ \left\{ \begin{array}{cl} \mathsf{\hat{\beta}_j}: & \textsf{marginal effect estimate} \\ \mathsf{\hat{\sigma}_j}: & \textsf{standard error of}~\mathsf{\hat{\beta}_j} \end{array}\right. \]

We are interested in a statistical problem motivated by genetics.

Why do we consider multiple-SNP models \({\sf Y} \sim {\sf X_1} + {\sf X_2} + \ldots + {\sf X_p}\)?

Review: Multivariate linear models for GWAS (Sabatti, 2013).

Select references: Moser et al. (2015); Loh et al. (2015); Bottolo et al. (2013); Zhou et al. (2013); Carbonetto & Stephens (2012); Guan & Stephens (2011); Kang et al. (2010); Logsdon et al. (2010); Yang et al. (2010); Hoggart et al. (2008); Wu et al. (2009); Servin & Stephens (2007).

Why do we consider single-SNP summary data \(\{\hat{\beta}_{\sf j}, \hat{\sigma}^{\sf 2}_{\sf j}\}\)?

Review: Pasaniuc & Price (Nature Reviews Genetics, 2017).

Editorial: Nature Genetics, July 2012.