Course description

This course examines the problems of multiple testing and statistical inference from a modern point of view. High-dimensional data is now common in many applications across the biological, physical, and social sciences. With this increased capacity to generate and analyze data, classical statistical methods may no longer ensure the reliability or replicability of scientific discoveries. We will examine a range of modern methods that provide statistical inference tools in the context of modern large-scale data analysis. The course will have weekly assignments as well as a final project, both of which will include both theoretical and computational components.


Stat 24400 or equivalent.


Course syllabus.

Course materials

R code for COPD/statin gene expression data: COPD_statin_gene_expr.R
R code from week 4 for COPD/statin data set & leukemia data set: COPD_statin_mixturemodel.R, leukdata.R
R code for week 6: IBD_GWAS_script.R (use data IBD_GWAS_data.csv), simulation_GWAS.R, err_TypeM_TypeS.R, prediction_intervals.R
Week 7 - demo for conditioning on the event that a multivariate Gaussian lies in a set: gaussian_condition_on_set.R , gaussian_condition_on_set.pdf, gaussian_condition_on_slice.pdf


Problem set 1: assignment ProbSet1.pdf
P-hacking challenge: assignment p-hacking_challenge.pdf, data set p-hacking_dataset.txt
Problem set 2: assignment ProbSet2.pdf
Real data analysis critique: assignment real_data_critique.pdf
Problem set 3: assignment ProbSet3.pdf
Problem set 4: assignment ProbSet4.pdf, scripts emaildata_getdata.R and emaildata_simpleexample.R
*Optional* Problem set 5: assignment ProbSet5.pdf, code to provide a needed function conditional_affine.R

Final Project

Final project info: final_project_ideas.pdf