Exact p-value computation and exact False Discovery Rate estimation for genome-wide association studies
Jerome Wojcik, Mickael Guedj, Karl Forner
Merck Serono, 9, chemin des Mines, 1202 Geneva, Switzerland
Genome-wide case-control association studies aim at identifying significant differential markers between sick and healthy populations. With the development of large-scale technologies allowing the genotyping of thousands of single nucleotide polymorphisms (SNPs) comes the practical issue of selecting the most probable set of associated markers. To overcome problems induced by computational noise and multiple testing, highly accurate but still tractable statistical methods are required. We have developed a pipeline of tools and exact algorithms (i.e. non-asymptotic and not permutation-based) to meet those requirements. First, exact and unbiased association probability values (p-values) are computed for allelic and genotypic frequencies, and second exact False Discovery Rate (FDR) are estimated for various statistics, including classical single-study statistic (Pearson allelic or genotypic), an allelic-genotypic meta-statistic, and a multi-studies replication statistic. All algorithms have been optimized to be computationally tractable. Benchmarked on simulated data, they outperform previous methods in terms of accuracy and computation time. For example, a genome-wide association study involving 1,000 individuals and 500,000 SNPs is comprehensively analyzed and FDR-controlled in 2 hours. Last, we exemplify the benefits of these novel methods by applying them to the analysis of experimental genotyping data of three Multiple Sclerosis case-control association studies.