Thaleia Ntiniakou, "A framework for employing probabilistic topic models on gene expression data", Master Thesis, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019
https://doi.org/10.26233/heallink.tuc.83929
One of the most important problems in computational biology is extracting knowledge and identifying patterns in real world biological datasets.In particular, microarray analysis experiments measure gene expression, the fundamental process by which gene products such as proteins are created, and which gives rise to the gene phenotype.Gene expression data can be analyzed to uncover genes or groups thereof, which are accountable for the development of specific diseases.In this thesis, we employ Probabilistic Topic Modeling (PTM), a category of unsupervised learning algorithms, for gene expression data analysis.PTM was first introduced and applied for extracting latent ``topics'' in text documents. Here we use them to uncover the genetic patterns responsible for biological processes and trigger specific diseases.More precisely, this thesis contributes a generic framework that allows the use of any PTM algorithm of choice for gene expression data analysis.Our framework allows the incorporation of data preprocessing and transformation techniques, to permit thethe preprocessing of gene expression data into the ``bag of words'' paradigm, a format that the majority of Probabilistic Topic Models require as input.Following this potential data transformation, the PTM algorithm of choice is employed to extract probabilistic topics---that is, the hidden probability distributions (themes) over the genes (words), which govern the creation of biological samples (documents).The extracted topics are subsequently utilized for performing dimensionality reduction, particularly feature selection and feature extraction, of the most important features (genes), that characterize the dataset.Finally, the framework comes complete with modern topics' visualization techniques.We populate our framework with various data transformation algorithms, and with two PTM techniques: Latent Dirichlet Allocation (LDA), a well-established PTM technique, and Latent Process Decomposition (LPD), an algorithm introduced specifically for the microarray setting.One of the data transformation algorithms we employ is novel, designed specifically for the task at hand.Moreover, we propose the novel use of two scoring methods ("KL-divergence'' and "Relevance Score'') to assist our feature selection efforts.We conduct a systematic evaluation of our techniques for feature selection and feature extraction tasks in this setting, using two real-world gene expression datasets---a recent dataset associated to muscle tissue conditions, and a frequently used breast cancer-related dataset.Overall, our results indicate that PTM algorithms can be quite successful in dimensionality reduction tasks in this setting, exhibiting performance that is usually at least comparable to that of the baseline algorithms used for evaluation; with the performance of LPD in feature extraction tasks being particularly noteworthy.Moreover, interesting conclusions on the efficacy of our various data transformation algorithms when combined with LDA are drawn in the process.Finally, this thesis demonstrates and helps underscore the fact that PTMs allow for the easy visualization of the hidden underlying genetic patterns at work in gene expression processes, and can therefore provide much needed assistance to biologists attempting to identify interesting classes of genes (i.e., carrying out gene annotation and enrichment analysis tasks).