Zhou "Ark" Fang of the Department of Biostatistics defends his dissertation on "Integration and Missing Data Handling in Multiple Omics Studies".
Graduate faculty of the University and all other interested parties are invited to attend
In modern multiple omics high-throughput data analysis, data integration and missingness data handling are common problems in discovering regulatory mechanisms associated with complex diseases and boosting power and accuracy. Moreover, in genotype imputation, or genotyping problem, the integration of linkage disequilibrium (LD) and identity-by-descent (IBD) information becomes essential to reach universal superior performance. In pathway analysis, when multiple studies of different conditions are jointly analyzed, simultaneous discovery of differential and consensual pathways, and reducing pathway redundancy introduced by combining public pathway databases, is valuable for knowledge discovery. This dissertation focus on the development of a Bayesian multi-omics data integration model with missingness handling, a novel genotype imputation methods incorporating both LD and IBD information, and a comparative pathway analysis integration method.
In the first chapter of this dissertation, inspired by the popular Integrative Bayesian Analysis of Genomics data (iBAG), we propose a full Bayesian model that allows incorporation of samples with missing omics data. Simulation results show improvement of the new full Bayesian approach in terms of outcome prediction accuracy and feature selection performance when sample size is limited and proportion of missingness is large. However, when sample size is large or the proportion of missingness is low, incorporating samples with missingness may introduce extra inference uncertainty. Therefore we also propose a self-learning cross-validation (CV) scheme to facilitate imputation decisions. Simulations and a real application on child asthma dataset demonstrate superior performance of the CV decision scheme when various types of missing mechanisms are evaluated.
In the second chapter, we propose a novel genotype inference method, namely LDIV, to integrate both LD and IBD information. To evaluate our approach, we simulated individuals in different family structures, with variants of all rarity sequenced in a wide range of depth. Simulation and real data results showed that with an informative family structure, LDIV could significantly increase the genotype accuracy across variants with different rarity.
The third chapter presents a meta-analytic integration tool, Comparative Pathway Integrator (CPI), to discover consensual and differential enrichment patterns using adaptively weighted Fisher method, reduce pathway redundancy by consensus clustering, and assist explanation of the pathway clusters with a novel text mining algorithm. We applied CPI to jointly analyze six psychiatric disorder transcriptomic studies to demonstrate its effectiveness, and found functions confirmed by previous biological studies as well novel enrichment patterns.