Abstract
We evaluated the performance of machine learning models in predicting cell state based on single cell RNA sequencing data (scRNAseq) analysis. scRNAseq quantifies the number of mRNA molecules in individual cells providing information on regulatory relationships between genes and the trajectories of distinct cell lineages in development. Different cells may express different genes at different levels which is key to analyzing the molecular phenotypes of cells. We tested nine machine learning models for their ability to distinguish subclonal cell populations in scRNAseq data by making predictions of cell state. Two scRNAseq datasets were examined: one with cells separated by cell phase: S, G1, or G2/M and one with distinct tumor and non-tumor cells. The pipeline consisted of data filtering, dimensionality reduction with Principal Component Analysis, naive projection with Uniform Manifold Approximation and Projection, and cluster analysis using Ward, BIRCH, Gaussian Mixture Model, DBSCAN, Spectral, Affinity Propagation, Average Linkage Agglomerative Clustering, Mean Shift, and K-Means models. These represent centroid, density, hierarchical, distribution, and graph-based clustering. Models failing to generate the true number of cells of each type were eliminated from further evaluation. The rest were evaluated on accuracy on a true cell type versus predicted cell type basis with a mapping algorithm. The true molecular phenotypes of the plotted cells were annotated onto the data points for a biological examination of the clusters. Six models divided G2 vs. S cells; Spectral and Ward clustering performed best with ~60% accuracy on average. Parallel to that result, seven models divided the total population into the tumorous and non-tumorous subclonal cell populations; Ward and BIRCH clustering performed best with ~80% accuracy on average. Overall, unsupervised machine learning models were effective in predicting molecular phenotypes of cells by accurately identifying subclonal cell populations and analyzing the relationship between gene expression levels and cell state. BIRCH, Ward, and Spectral clustering showed the highest performance.
Citation Format: Anastasia Dunca, Frederick R. Adler, Mark W. Smithson. Predicting molecular phenotypes with scRNAseq: An assessment of unsupervised machine learning methods [abstract]. In: Proceedings of the AACR Virtual Special Conference on Radiation Science and Medicine; 2021 Mar 2-3. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(8_Suppl):Abstract nr PO-064.