John Pleis of the Department of Biostatistics defends his dissertation on "Mixtures of Discrete and Continuous Variables: Considerations for Dimension Reduction".
Graduate faculty of the University and all other interested parties are invited to attend
When working with high-dimensional data, we are often presented with statistical and computational challenges. Statistically, a major concern is model overfitting, and we wish to take advantage of an assumed low-dimensional data substructure to reduce the dimensionality of the problem. Because of the current scale of some of these types of problems, dimension reduction methods have been an active research area for several years and many different methods have been developed (e.g., regularization, matrix rank constraints, principal components).
Classical methods, such as principal components, have been utilized frequently but often rely on Gaussian distributional assumptions. When this assumption is not tenable, such as when we have mixtures of different types of data, dimension reduction approaches are more limited; the body of existing research on the analysis of multiple types of data is not nearly as developed. As a result, this dissertation we will focus on certain aspects of dimension reduction for mixtures of continuous and discrete data. Of particular interest is Mardia’s hypothesis that the first m eigenvalues contain all the important variation for the multivariate data. However, this hypothesis test is based on the assumption of multivariate normality. At this point, we are not aware of methodological developments addressing such aspects of principal components analysis for mixtures of continuous and discrete data. Thus, we will first focus on GLOM-based (general location model) methodology attributed to Olkin and Tate for deriving the joint probability distribution of continuous and discrete random variables as the product of conditional and marginal probability distributions. While existing hypothesis testing approaches generally assume multivariate normality, this dissertation will alter that paradigm slightly by focusing on the finite k-component mixture of multivariate normal distributions. Because of this previous multivariate Gaussian assumption, the distribution of the sample covariance matrix has been assumed to be a central Wishart distribution. If the multivariate data is assumed to have been generated from a finite k-component mixture of multivariate normal distributions, an initial focus of this dissertation is to determine the distribution of the sample covariance matrix under this scenario.
For this effort, we initially examine the distribution of the sample variance in a univariate setting by assuming the data was generated from a finite k-component mixture of Gaussian distributions. By using concepts from the field of complex analysis via inversion of the Laplace transform, we will show that the distribution of the sample variance is a mixture of gamma distributions with different scale parameters. Because such a distribution can be challenging to work with, we apply an approach similar to the Satterthwaite method to approximate the mixture distribution with a classical gamma distribution with one scale parameter via matching of the first two moments.
In a similar manner, we examine the distribution of the sample covariance matrix in a multivariate setting by assuming the data was generated from a finite k-component mixture of multivariate normal distributions. Also, we will derive marginal distributions when the joint distribution of the data is a finite k-component mixture of multivariate Gaussian distributions. Through the use of the Laplace transform and Jacobians of a matrix argument, we will show that the distribution of the sample covariance is a mixture of Wishart distributions with differing scale parameters. Similar to the univariate setting, such a distribution can be challenging to work with; a multivariate extension of the Satterthwaite method via moment matching may be of interest. However, because the number of means, variances, and covariances is larger than the number of unknowns a different approach is demonstrated. By using scalar representations of the sample covariance matrix (e.g., determinant, trace), we will illustrate approximating the mixture of Wishart distributions with a Wishart distribution having one scale parameter. Because the approximating distribution may have fractional degrees of freedom, we examine some considerations for simulating data from a Wishart distribution with fractional degrees of freedom. Finally, we evaluate the adequacy of the approximation method through the use of matrix norms (e.g., 1-norm, 2-norm, Frobenius). Concluding remarks will summarize the impact of these developments on Mardia’s hypothesis test of proportionate eigenvalues.
KEYWORDS: mixture distribution; eigenvalues; high-dimensional; Wishart; fractional degrees of freedom