Mingyao Li, PhD, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania.
Peter Mueller, PhD, Department of Mathematics, Department of Statistics and Data Sciences, University of Texas at Austin
Lu Mao, PhD, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
Snehalata Huzurbazar, PhD, Department of Biostatistics, West Virginia University
Meetings of the Eastern North American Region of the International Biometric Society (a.k.a. "ENAR meetings") are held in late March or early April each year and reflect the broad interests of the Society, including both quantitative techniques and application areas. Faculty and student presenters from the Department of Biostatistics regularly participate giving invited talks, contributed talks, and poster presentations.
The Joint Statistical Meetings, known simply as "JSM", is the largest gathering of statisticians held annually in North American. Faculty and student presenters from the Department of Biostatistics regularly participate giving invited talks, contributed talks, and poster presentations. Our students often receive top awards and participate in the affiliated career marketplace at the event.
Next generation sequencing (NGS) technology has emerged as a powerful tool in characterizing genomic profiles. Among several applications, RNA sequencing (RNA-Seq) and Methylation sequencing (Methyl-Seq) have gradually become standard tools for transcriptomic and epigenetic monitoring respectively. Although the costs of NGS experiments have constantly decreased, the high cost and bioinformatic complexity remain obstacles for many biomedical projects. Unlike earlier microarray technologies, modeling of NGS data should consider discrete count data. In addition to sample size, sequencing depth is also directly related to experimental costs. Consequently, given a total budget and a pre-specified unit experimental cost, the study design issue in RNA-Seq/Methyl-Seq is a multi-dimensional constrained optimization problem rather than a one-dimensional sample size calculation in a traditional hypothesis setting. In the first part of this dissertation, we proposed a statistical framework, namely “RNASeqDesign”, to utilize pilot data for power calculation and study design of RNA-Seq experiments. The approach was based on a mixture model fitting of the p-value distribution from pilot data and a parametric bootstrap procedure to infer genome-wide power for optimal sample size and sequencing depth. We further illustrated five practical study design tasks for practitioners. We performed simulations and real data applications to evaluate performance and compare to existing methods.
In the second part of this dissertation, we proposed another statistical framework, namely “MethylSeqDesign”, specifically for Methyl-Seq data. There were mainly two challenges. Firstly, the statistical modeling for Methyl-Seq data required a powerful statistical test using beta-binomial model for conducting power calculation. Secondly, there is an extremely high number of CpG sites (about 30M) in the human genome, which results in many CpG sites with very shallow coverage. We focused on a region-/capture-based method which produced more counts in a region/window such that power calculation became feasible.
Public health significance: As sequencing costs keep dropping, RNA-Seq and Methyl-Seq experiments will become more prevalent and more projects with large sample size will be expected. We believe our work will provide practical guidance for future study design to understand disease mechanism and improve disease diagnosis and treatment.
Last Updated On Friday, July 07, 2017 by Valenti, Renee Nerozzi
Created On Tuesday, March 07, 2017