Zhe Sun of the Department of Biostatistics defends her dissertation on “Novel Statistical Methods in Analyzing Single Cell RNA Sequencing Data”.
Committee Chairpersons: Ying Ding, PhD, Department of Biostatistics and Wei Chen, PhD, Department of Pediatrics
Kong Chen, PhD, Department of Medicine
Ming Hu, PhD, Department of Quantitative Health Sciences, Cleveland Clinic
Yongseok Park, PhD, Department of Biostatistics
Graduate faculty of the University and all other interested parties are invited to attend
Understanding of biological systems requires the knowledge of their individual components. Single cell RNA sequencing (scRNA-Seq) becomes a revolutionary tool to investigate cell-to-cell transcriptomic heterogeneity, which cannot be obtained in population-averaged measurements such as the bulk RNA-Seq. The newly developed droplet-based system enables parallel processing with digital counting of thousands of single cells in a single experiment, leading to the discovery of novel cell types which facilitates newly biological discoveries. This dissertation focuses on developing novel statistical methods for analyzing droplet-based scRNA-Seq data, which includes clustering methods to identify cell types from single or multiple individuals, and a joint clustering approach to simultaneously analyze paired data from scRNA-Seq and Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-Seq), a state-of-art technology that allows the detection of cell surface proteins and transcriptome profiling within the same cell simultaneously.
In the first part of this dissertation, I developed DIMM-SC, a Dirichlet mixture model which explicitly models the raw UMI count for clustering droplet-based scRNA-Seq data and produces cluster membership with uncertainties. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. In the second part, I developed BAMM-SC, a novel Bayesian hierarchical Dirichlet mixture model to cluster droplet-based scRNA-Seq data from population studies. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Extensive simulation studies and applications to multiple in house experimental scRNA-seq datasets using blood, lung and skin cells from humans or mice demonstrated that BAMM-SC outperformed existing clustering methods with considerable improved clustering accuracy, particularly in the presence of heterogeneity among individuals. In the third part, I developed RE-DIMM-SC, a novel random effects model that jointly cluster the paired data from scRNA-seq and CITE-Seq simultaneously. Simulations and analysis of in-house real data sets were performed, which successfully demonstrated the validity and advantages of our method in helping people understand the heterogeneity and dynamics of various cell populations in complex multicellular tissue or organs.
PUBLIC HEALTH SIGNIFICANCE: Recent droplet-based single cell sequencing technology and its extensions have brought revolutionary insights to the understanding of cell heterogeneity and molecular processes at single cell resolution. I believe the proposed statistical approaches in this thesis for single cell data will improve the identification and characterization of cell subtypes from heterogeneous tissues, which is essential to fully understand cell identity and cell function.