A One-hour Tour of Statistical Genetics: GWAS and Beyond

Xiang Zhu <xiangzhu@uchicago.edu>

2016-12-01 @ NHS Tutorial Session

Disclaimer

More of a discussion than a tutorial:

  • Hard to cover 100% of GWAS within \(\leq\) 1 hour
  • Biased to my own (mis)understanding in past \(\leq\) 4 years

Feel free to ask questions at any time!

Resources

PLINK:
http://pngu.mgh.harvard.edu/~purcell/plink/

Paper series:
http://www.nature.com/nrg/series/gwas/index.html

Phenotype \(Y\) & Genotype \(X\)

\[ Y:= \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right] ~~~ X:= \left[ \begin{array}{ccccc} x_{11} & \ldots & x_{1j} & \ldots & x_{1p}\\ x_{21} & \ldots & x_{2j} & \ldots & x_{2p}\\ \vdots & \vdots & \vdots & \ldots & \vdots \\ x_{n1} & \ldots & x_{nj} & \ldots & x_{np}\end{array} \right] \]


Single-SNP analysis:
correlate \(Y\) with each column of \(X\)

\[ Y:= \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right] \sim \left[ \begin{array}{c} x_{1j} \\ x_{2j} \\ \vdots \\ x_{nj} \end{array} \right] := X_j \]

Today’s agenda

  • part of \(X\) missing?

  • \(X\leftarrow\)hidden \(Z\rightarrow Y\)?

  • look at \(X_1,\ldots,X_p\) jointly?

  • change definitions of \(X\) and \(Y\)?

  • \(X\) and \(Y\) not available?

  • extra info beyond \(X\) and \(Y\)?

\(X\) missing? \(\leadsto\) Genotype imputation

Marchini and Howie (2010):
http://www.nature.com/nrg/journal/v11/n7/pdf/nrg2796.pdf

Imputed \(X\) = True \(X\) + Error

Guan and Stephens (2008):
http://dx.doi.org/10.1371/journal.pgen.1000279

General question: SNP quality control (QC)

QC protocol used by GIANT
http://www.genepi-regensburg.de/easyqc/

\(X\leftarrow\)hidden \(Z\rightarrow Y\)? \(\leadsto\) Confounding

(Source: http://mga.bionet.nsc.ru/~yurii/courses/ge03-2012/confounding.pdf)

\(X\leftarrow\)hidden \(Z\rightarrow Y\)? \(\leadsto\) Confounding

Recent review: Price et al (2010)
http://www.nature.com/nrg/journal/v11/n7/full/nrg2813.html

Look at \(X\) jointly? \(\leadsto\) Polygenic modelling

  • single-SNP model: \(Y \sim X_j\), \(j=1,2,\ldots,p\)
  • multiple-SNP model: \(Y \sim X_1 + X_2 + \ldots + X_p\)

For more information, see Zhou, Carbonetto and Stephens (2013)
http://dx.doi.org/10.1371/journal.pgen.1003264

(Source: http://www.mdpi.com/2073-4425/5/2/270)

Try different definitions of \(X\) and \(Y\)?

PrediXcan (2015):
http://www.nature.com/ng/journal/v47/n9/full/ng.3367.html
TWAS (2016):
http://www.nature.com/ng/journal/v48/n3/full/ng.3506.html

  • Stage 1: Gene expression (\(Y\)) \(\sim\) Genotype (\(X\))
  • Stage 2: Phenotype (\(Y\)) \(\sim\) Gene expression (\(X\))

No \(X\) and \(Y\)? \(\leadsto\) Summary statistics

Two types of GWAS data:

  • Individual-level genotypes and phenotypes
  • Single-SNP GWAS summary statistics

Pasaniuc & Price (2016): http://www.nature.com/nrg/journal/vaop/ncurrent/full/nrg.2016.142.html

Page 4-12 of Alkes Price’s Slides (ASHG, 2015)

Eg.1: LD score regression

Single-SNP test statistic can be “inflated” due to:

  • polygenicity: many small genetic effects
  • confounding biases: latent structure

Expected \(\chi^2\) stat = Slope \(\cdot\) LD score + Confounding biases

http://www.nature.com/ng/journal/v47/n3/full/ng.3211.html

Eg.2: Regression with summary statistics

We only need a likelihood based on summary data:

For more details, see Zhu and Stephens (2016+)
http://dx.doi.org/10.1101/042457 Or, ask me (CLSC, Room 412)

Get extra info? \(\leadsto\) Integrated approaches

The same unit of observation \(\leadsto\) human genome

Examples based on individual-level data (skipped):

Examples of integrated analysis

He et al (2013): GWAS + eQTL
http://www.cell.com/ajhg/abstract/S0002-9297(13)00159-6

Examples of integrated analysis

Pickrell (2014): GWAS + Functional annotations
http://www.cell.com/ajhg/abstract/S0002-9297(14)00106-2

Questions & Discussions