Defining the developmental origins of cancer can help uncover cellular mechanisms of cancer development and progression and identify effective treatments, but it has been challenging. In this issue of Cancer Discovery, Moiso and colleagues constructed a developmental map of 33 cancer types, based on which they deconvoluted tumors into developmental components and constructed a deep learning classifier capable of high- accuracy tumor type prediction.
Cancer is a heterogeneous disease. Within the same organ, cancer can arise from different cell types of origin, which may give rise to distinct subtypes, and the identity of the cell of origin can profoundly influence the course of tumor evolution and the eventual phenotypes of cancer such as the acquisition of genetic and epigenetic changes, activation of malignant programs, metastatic ability, and response to therapy (1). It is, therefore, critical to assess the cells of origin of cancers and tailor treatment strategies accordingly, in particular, for cancer of unknown primary (CUP), which remains a challenging clinical problem. The correct identification of the cell of origin is of the utmost importance in oncology, as it may help us to better understand the underlying biology that is essential for developing better patient stratification and more precise and effective therapies. However, defining the developmental origins of human cancers has proven difficult, as the accumulation of genetic drivers and nongenetic cancer cell plasticity (2), together with environmental factors (e.g., smoking, carcinogens; ref. 3) and the tumor microenvironment, can affect tumor development, promote subtype transition, and shape the phenotypes of cancers. However, the developmental landscape of human cancer has yet to be systematically characterized.
In this issue, Moiso and colleagues (4) performed a systematic developmental deconvolution of The Cancer Genome Atlas (TCGA) tumors across 33 cancer types to map their developmental trajectories (4). In their analysis, the authors used single-cell RNA sequencing (scRNA-seq) data from the Mouse Organogenesis Cell Atlas (MOCA) study (5) as the reference dataset. The MOCA dataset contains around 2 million single cells grouped into 56 developmental subtrajectories with comprehensive annotation, providing a single-cell transcriptional landscape of the mammalian organogenesis process (E9.5 to E13.5; ref. 5). The investigators first computed the rank-based correlation coefficients for commonly expressed genes (n = 15,929) between each TCGA sample and each MOCA single cell. The correlation coefficients from the same TCGA tissue type and the same MOCA developmental subtrajectory were then averaged, respectively, to derive a similarity score. The similarity scores were further scaled, followed by hierarchical clustering to generate a developmental map between 62 TCGA tissue types and 56 MOCA developmental subtrajectories. The resultant map provides a global view of relationships between tissue types and developmental trajectories. The authors were able to further reproduce their observed correlations in two independent datasets—a formalin-fixed paraffin-embedded tissue cohort (n = 40) and a single-cell cohort of human fetal tissues cataloging later embryonic stages of mid-gestation development (6). These results demonstrate the robustness of their correlation approach, supporting the notion that these observed correlations were likely due to underlying biological relationships.
The authors leveraged gene expression data on matched normal tissues from TCGA and performed tumor–normal comparisons of developmental relationships. As expected, their analyses suggest abnormal vascularization and dedifferentiated cells in some tumors relative to their matched normal tissues. To better characterize tumor cell dedifferentiation, they integrated the transcriptional dynamics of mouse organogenesis from the MOCA study (5) and investigated how developmental lineage relationships between samples and trajectories changed at different embryonic times when comparing tumor and normal tissues. Perhaps not surprisingly, they observed a significant pan-cancer enrichment for the earlier embryonic period and, consistently, a shift toward a lower embryonic period score in tumors compared with normal tissues. These results are consistent with our current understanding that cellular plasticity is a widespread property and hallmark capability of cancer (7), but it remains unclear (and likely indistinguishable) whether the observed lower embryonic period score in cancer cells was due to dedifferentiation of mature cells back to progenitor states or blocked differentiation (7). Notably, their analysis also revealed some unexpected relationships. For example, a high degree of similarity was found in lung-derived tumors including both lung adenocarcinoma and lung squamous cell carcinoma with gut-derived trajectories such as stomach and midgut/hindgut, whereas such similarity was not observed in normal lung tissues. Interestingly, it is known that the lung bud develops from the anterior foregut in embryonic week 3; these data are therefore in line with previous reports suggesting that some embryonic genes and embryonic developmental programs are reexpressed in cancer cells (8). It is noteworthy that oncofetal reprogramming has been demonstrated in both malignant cells and the tumor microenvironment and appears as an emerging area of cancer research (9). It will be of great interest to systematically characterize fetal-like cells and their roles in tumor development and progression, which may provide new avenues for therapeutic targeting of tumor-specific signaling networks.
Next, the team attempted to systematically deconvolute bulk tumor transcriptomes into component developmental programs. For each sample, a deconvolution score was generated for each developmental subtrajectory at each embryonic time point for a total of 214 scores [developmental components (DC)]. To assess the performance of the deconvolution method, the authors examined a variety of annotated malignant and nonmalignant cell types from 13 different scRNA-seq studies and obtained consistent developmental signals (no matter from one cell or aggregately from 1,000 single cells) for each cell type across different studies. In addition, for a hepatocellular carcinoma sample sequenced by scRNA-seq, they deconvoluted the combined gene expression of all nonmalignant and malignant single cells from that tumor and compared the resultant DCs with that of one bulk-sequenced hepatocellular carcinoma sample from the TCGA and observed a high similarity between them. The authors went on and deconvoluted all TCGA samples. A more detailed examination showed various levels of sample-to-sample variation in certain DCs within a tumor type but overall distinct profiles across tumor types. Notably, uniform manifold approximation and projection (UMAP) dimensionality reduction of 214 DC scores for each sample largely clustered most tumor types distinctly according to the tissue of origin. Collectively, these results suggested that their developmental deconvolution method is able to capture relevant signals from both malignant cells and admixed normal cells and that the resultant DC scores can potentially be used to resolve different tumor types.
Motivated by these observations, the authors sought to build a machine learning model to classify malignancies. They constructed a developmental multilayer perceptron (D-MLP) classifier using DC scores as input. When trained on DC scores, the D-MLP classifier reached an overall “top 1” prediction concordance of 73%, with a receiver operator curve area under the curve (ROC-AUC) of 0.974, and, strikingly, with high precision and recall for most tumor types. The authors further validated the high performance of the D-MLP classifier by applying it to annotated malignant cells from scRNA-seq studies. In addition, they also assessed how tumor purity affects classification and showed that for samples with >20% tumor purity, the D-MLP classifier remains relatively robust, with an ROC-AUC of >0.787, and it is noteworthy that it outperforms benchmark classifiers trained directly on gene expression data. The high performance of the D-MLP classifier is likely due to two main reasons. First, different from other gene expression classifiers, the D-MLP classifier uses developmental trajectories to dimensionally reduce gene expression data to avoid model overfitting without compromising its predictive power. Second, a key feature of a multilayer perceptron neural network model is its ability to simultaneously perform feature extraction and classification, and it takes advantage of backpropagation, an algorithm that allows to iteratively adjust the weights of the network, thereby increasing the model's accuracy.
Lastly, to showcase the potential value of the D-MLP classifier in clinical settings, the authors gathered a cohort of 52 patients with CUP. Analysis of the relationships between DC scores and D-MLP classifier predictions suggested that some CUPs could be resolved by transcriptomic profiling, whereas others may express overlapping developmental programs or are likely dedifferentiated. One case was selected for further investigation, and, inspiringly, whereas the CUP remained undiagnosed with available techniques including hematoxylin and eosin examination of tissue, IHC staining, mutation, and molecular profiling, the D-MLP classifier accurately predicted its cell of origin. With regard to patients with CUPs, chemotherapy regimens have been disappointing, and the overall survival has not improved over the last decade. Understanding the cells of origin and underlying biology of CUPs is therefore an indispensable step toward improving patient care. These results, although preliminary, suggest that new computational approaches such as the D-MLP classifier could present a chance to aid in the identification of the developmental origins of CUPs, provide insights into CUP biology, and enable targeted, effective therapeutic interventions to improve outcomes.
In summary, this elegant study by Moiso and colleagues presents an innovative approach for the developmental deconvolution of cancer, provides an atlas of tumor developmental origins, and offers a deep learning classifier that shows promise in improving CUP diagnostics. Although this approach appears to be robust, there is still room for improvement. First, although MOCA represents a complete single-cell developmental reference atlas of mammalian organogenesis, there are differences in species and developmental stages between human and mouse, and not all human cell lineages and subtypes match 1:1 with mouse cell lineages and subtypes (6). Therefore, constructing a new classifier using human single-cell atlas datasets will be of great importance. Second, the current deconvolution approach focuses on distinguishing different tissue types and the challenge for the future will be to improve its ability to distinguish different entities within a tissue type (e.g., lung adenocarcinoma vs. squamous cell carcinoma) and possibly within an individual tumor (e.g., mixed-histology cancer). However, this approach could be limited by the possibility that tumors originating from different cells of origin could be undistinguishable based on gene expression profiles.
Importantly, cell of origin and tumor subtype are not directly linked, and more work is therefore needed to better understand the underlying developmental heterogeneity within tumors. Given the inter- and intratumoral heterogeneity as well as subtype transdifferentiation, it will be interesting to understand to what extent the cancer cell state, as measured by RNA expression, is dictated by the cell of origin rather than the acquired phenotypic plasticity. Notably, the tumor–normal comparison revealed some intriguing lineage relationships. Further study of these relationships will be essential to better understand tumorigenesis. For example, interrogating reexpressed embryonic genes as well as the embryonic developmental programs in cancer cells will likely reveal important insights into oncofetal reprogramming, which may represent new therapeutic avenues. Moreover, emerging evidence also suggests fetal-like reprogramming of the tumor microenvironment such as the reemergence of fetal-associated endothelial cells and tumor-associated macrophages in the tumor ecosystem (10). It raises new challenges and intriguing questions. In particular, accurately deconvolute malignant and nonmalignant compartments with/without fetal-like features from bulk-sequenced tumors remains a critical challenge. In addition, this study also showed some discordant D-MLP predictions and revealed potentially interesting new classification schemes that will require further investigation and careful validation in larger cohorts. Lastly and perhaps more importantly, the D-MLP classifier shows promise in revealing the developmental origins of CUPs and the potential to supplement diagnostic decision-making, but challenges may arise in clinical settings and affect its prediction performance. Further prospective assessment within multi-institutional clinical trials will be important to comprehensively evaluate its clinical significance and limitations.
Author's Disclosures
No disclosures were reported.
Acknowledgments
L. Wang is supported in part by the NIH/NCI (R01CA266280 and U01CA264583), the Cancer Prevention & Research Institute of Texas (RP200385 and RP220101), the Break Through Cancer Foundation, the start-up research fund and the University Cancer Foundation via the Institutional Research Grant Program at The University of Texas MD Anderson Cancer Center.