Large-Scale Machine Learning Algorithms for Biomedical Data Science
Data science is accelerating the translation of biological and biomedical data to advance the detection, diagnosis, treatment, and prevention of diseases, including the recently announced BRAIN and Precision Medicine initiatives. Sparsity is one of the intrinsic properties of real-world data, thus the sparse learning has recently emerged as a powerful tool to obtain models of high-dimensional data with high degree of interpretability at low computational cost, and provide great opportunities to analyze the big, complex, and diverse datasets. To address the challenging problems in current biomedical data science, we proposed several novel large-scale structured sparse learning models for multi-dimensional data integration, heterogeneous multi-task learning, group/graph structured data analysis, and longitudinal feature learning. Meanwhile, to deal with the big data computations, we proposed distributed asynchronous stochastic gradient and coordinate descent methods for efficiently solving convex and non-convex problems.
We applied our new large-scale machine learning models to analyze the multi-modal and longitudinal neuroimaging and genome-wide array data in Imaging Genetics and discover the phenotypic and genotypic biomarkers to characterize the neurodegenerative process in the progression of Alzheimer’s disease and other complex brain disorders. We also utilized our new machine learning models to analyze the Electronic Medical Records (EMR) for predicting the heart failure patients’ readmission and drug side effects, identify the histopathological image markers and the multi-dimensional cancer genomic biomarkers in The Cancer Genome Atlas (TCGA) for precision medicine, predict performance and guide design of nanoparticle synthesis in Materials Genome research, and detect the DTI and fMRI based brain circuitry patterns in Human Connectome.