Genome-wide enrichment analyses of 31 human traits

Xiang Zhu <xiangzhu@uchicago.edu>

2017-01-11 @ Uchicago HG Work in Progress Seminar

Examining associations between variables is a useful tool in genetics.

Enrichment analysis combines multiple sources of association.

Genotype-phenotype: genome-wide summary data
Genotype-genotype: linkage disequilibrium
Gene-gene: biological pathway, RNA-seq, …

Let’s keep this talk “jargon-free”!

1. Genome-wide summary data

2. Enrichment analysis

What are genome-wide summary data?

Data: phenotype \(Y\) and genotype \(X\)

Size: \(n\) (>10K) individuals and \(p\) (>1M) variants (SNPs)

\[ Y:= \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right] ~~~ X:= \left[ \begin{array}{ccccc} x_{11} & \ldots & x_{1j} & \ldots & x_{1p}\\ x_{21} & \ldots & x_{2j} & \ldots & x_{2p}\\ \vdots & \vdots & \vdots & \ldots & \vdots \\ x_{n1} & \ldots & x_{nj} & \ldots & x_{np}\end{array} \right] \]

Model: single-SNP analysis

\[ \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right] \sim \left[ \begin{array}{c} x_{1j} \\ x_{2j} \\ \vdots \\ x_{nj} \end{array} \right] ~~ \leadsto ~~ \left\{ \begin{array}{cl} \hat{\beta}_j: & \textsf{marginal effect estimate} \\ \hat{\sigma}_j: & \textsf{standard error of}~\hat{\beta}_j \end{array}\right. \]

Why are we interested in single-SNP summary data?

Review: Pasaniuc & Price (Nature Reviews Genetics, 2017).

Source: http://www.nature.com/ng/journal/v44/n7/full/ng.2345.html

What is gene set enrichment?

Phenotype: low-density lipoprotein (Teslovich et al. 2010)
Pathway: chylomicron-mediated lipid transport (17 genes)
Annotation: Is the SNP near a pathway gene? (yes or no)

Recent Reviews: de Leeuw et al. (2016); Pers (2016); Mooney et al. (2014); Wang et al. (2010).

The idea is simple, but there are (at least) two statistical issues.

1. We should relax significance threshold for “green” SNPs, but how much to relax?

Threshold \(\longleftarrow\) Function (Pathway, Phenotype)

2. The inflated pattern of “green” curve can be due to correlation between SNPs.

An extreme example:

SNP 1 has a large genetic effect on a trait.
SNPs 2-100 have zero effect, but all are in high LD with SNP 1.
Thus, SNPs 1-100 all show very large single-SNP z-scores.

We develop a method that systematically leverages enrichment information.

We apply the method to 31 phenotypes and 3977 gene sets.

This application is not small:

# of Parameters = 31 \(\times\) (3913+64) \(\times\) 1.1 Million \(\approx\) 136 Billion

31 human phenotypes
3913 biological pathways curated by experts
64 tissue-based gene sets derived from GTEx data
1.1 million common SNPs

One graduate student can get this done, aided by:

Publicly available summary data
Variational Bayes algorithms
Banded matrix approximation
Parallel computing
Hierarchical data format (HDF5)

Our full results are publicly available.

Results: http://xiangzhu.github.io/rss-gsea/_book
Software: https://github.com/stephenslab/rss

Pathway enrichment mirrors trait category.

Example: Blood lipid & MTTP gene

References: Global Lipids Genetics Consortium (2013); Teslovich et al. (2010).

Example: Alzheimer’s disease & Liver

References: The GTEx Consortium (2015); Lambert et al. (2013).

What did I learn from this project?

Data integration is a powerful idea.

Bayesian hierarchical modelling

Fast computation is important for large-scale applications.

Variational inference; Parallel computing; HDF5

Association does not imply causation.

Short-term goal: checking bias and confounding
Long-term goal: learning “causality” from data

What’s next?

Add more members to the family of RSS models

Develop user-friendly black-box `R`

https://github.com/stephenslab/rssr