Single-cell (sc) transcriptomics has revolutionized our understanding of the biological characteristics and dynamics of cancer development. It can help us identify rare cell subpopulations and understand mechanisms associated with tumor genesis, progression, and response to therapy. The most important step in the analyses of any scRNA-seq dataset is subpopulation identification, usually performed via unsupervised clustering, followed by gene marker identification. We created a highly customizable workflow for sc data analysis, implemented in Common Workflow Language (CWL) on the Cancer Genomics Cloud (CGC) platform. The NCI-funded CGC platform, powered by Seven Bridges, provides a collaborative cloud base computation infrastructure that collocates computation, over 750 bioinformatics workflows, and 3+ PB data to researchers, making the analysis of large datasets accessible from any environment. The “Multi-Sample Clustering and Gene Marker Identification with Seurat 4.1.0” workflow comprises the following steps: Loading scRNA-seq Expression Datasets, Quality Control and Preprocessing, and Clustering and Identification of Gene Markers. Our solution supports gene-cell count matrices generated by several commonly used quantifiers (for example, Cell Ranger counts, Salmon Alevin, Kallisto BUStools, STAR) from single or multiple sc datasets from different batches, as well as single or multiple single-cell samples combined in a single SingleCellExperiment object. The versatility of the pipeline is obtained using several implemented options in each of the steps. Quality control can be performed manually or automatically using several options for normalization (LogNormalize, Deconvolution, SCnorm and Linnorm) and for batch effect correction (Seurat and Harmony). For clustering, the pipeline uses Seurat's graph-based approach, with options for different clustering resolutions. After performing identification of gene markers for each cluster, a researcher can test differential expression using various packages including wilcox, bimod, roc, and DESeq2. Here, we demonstrate the application of this workflow to a typical sc analysis, by processing an open access dataset of 61k cells isolated from embryonal mouse pons and forebrain, two major brain tumor locations. We used different clustering resolutions to achieve different degrees of granularity and identified cluster-specific marker genes used to identify vulnerable cell populations. To enable researchers to use this analysis as a guideline, we made this analysis available as a public project. Further development of single-cell sequencing techniques will undoubtedly improve our understanding of tumor biology and highlight promising drug targets. CGC’s cloud base computation infrastructure, along with numerous available cancer datasets and easy-to-use single-cell data processing workflows, among others, will be instrumental in this process.

Citation Format: Nevena Vukojicic, Aleksandar Danicic, Zelia Worman, Rowan Beck, Dalibor Veljkovic, Marko Matic, Jack DiGiovanna, Brandi Davis-Dusenbery. Highly customizable multi-sample single cell RNA-Seq pipeline on the CGC [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 2075.