Abstract
Consensus molecular subtyping (CMS) of colorectal cancer has potential to reshape the colorectal cancer landscape. We developed and validated an assay that is applicable on formalin-fixed, paraffin-embedded (FFPE) samples of colorectal cancer and implemented the assay in a Clinical Laboratory Improvement Amendments (CLIA)-certified laboratory.
We performed an in silico experiment to build an optimal CMS classifier using a training set of 1,329 samples from 12 studies and validation set of 1,329 samples from 14 studies. We constructed an assay on the basis of NanoString CodeSets for the top 472 genes, and performed analyses on paired flash-frozen (FF)/FFPE samples from 175 colorectal cancers to adapt the classifier to FFPE samples using a subset of genes found to be concordant between FF and FFPE, tested the classifier's reproducibility and repeatability, and validated in a CLIA-certified laboratory. We assessed prognostic significance of CMS in 345 patients pooled across three clinical trials.
The best classifier was weighted support vector machine with high accuracy across platforms and gene lists (>0.95), and the 472-gene model outperforming existing classifiers. We constructed subsets of 99 and 200 genes with high FF/FFPE concordance, and adapted FFPE-based classifier that had strong classification accuracy (>80%) relative to “gold standard” CMS. The classifier was reproducible to sample type and RNA quality, and demonstrated poor prognosis for CMS1–3 and good prognosis for CMS2 in metastatic colorectal cancer (P < 0.001).
We developed and validated a colorectal cancer CMS assay that is ready for use in clinical trials, to assess prognosis in standard-of-care settings and explore as predictor of therapy response.
In this article, we have developed a gene expression assay for consensus molecular subtyping (CMS) of colorectal cancer that shows prognostic relevance of colorectal cancer CMS. The assay was validated in a Clinical Laboratory Improvement Amendments–certified laboratory and is applicable for clinical trials in the current format.
Introduction
Colorectal cancer is the third most common cancer and a leading cause of cancer-related deaths worldwide. Several articles were published introducing colorectal cancer molecular subtyping systems, each partitioning colorectal cancer into three to six subtypes (1–7). The translational value of these works was limited by their relatively small sample sizes and lack of consensus regarding which of the six subtyping systems best captured the tumor heterogeneity and had superior utility as predictive and/or prognostic marker. In this context, Guinney and colleagues (8) assembled the colorectal subtyping consortium (CRCSC) that sought to identify consensus molecular subtypes (CMS) by assembling a database of gene expression measurements from 4,151 patients with colorectal cancer from a collection of 18 international studies, having each of the six subtyping systems applied to each of these samples, and then using a network analysis to identify consensus clusters. The four consensus subtypes were identified primarily on the basis of the biologic characteristics of colorectal cancer. However, findings by Guinney and colleagues, and subsequent other studies, have demonstrated prognostic and predictive value of CMS in colorectal cancer (9–17).
To fully realize these potential benefits of CMS, it is necessary to have a robust, reliable single-sample classifier to discern a colorectal cancer patient's CMS from the tumor tissue. As part of an international consortium (8), we previously presented a random forest classifier that included 5,973 genes and a “single-sample” classifier based on nearest centroid predictor applied using 693 genes, built primarily using microarrays designed for use with flash-frozen (FF) samples. Efforts are needed to build a more parsimonious single-sample classifier using fewer genes that can be reliably run on RNA extracted from formalin-fixed, paraffin embedded (FFPE) samples. In this article, we introduce an FFPE-based CMS classifier using the NanoString platform that has strong accuracy for predicting the CMS in colorectal cancer samples including those from the CRCSC study. This gene classifier was discovered and validated in silico by using the CRCSC datasets and subsequently optimized on the basis of degree of correlation across tissue types, that is, FFPE versus FF samples, and platform type, NanoString versus Affymetrix. Subsequently, we validated this FFPE tissue–based gene classifier in a Clinical Laboratory Improvement Amendments (CLIA)-certified molecular diagnostic laboratory (MDL) and demonstrate prognostic significance of CMS in colorectal cancer.
Materials and Methods
Development and validation of CMS classifier on CRCSC
We performed in silico development and validation of the CMS classifier using the samples and datasets that were part of the “consensus” set in CRCSC, meaning that they had so-called “gold standard” CMS status, defined on the basis of agreement among the six different subtyping systems, against which we could compare to assess classification accuracy of our CMS classifier. Details of the discovery and validation approaches, datasets used, including tissue type, total number of samples, total number of consensus samples, and preprocessing method are shown in Fig. 1A, Supplementary Table S1, and Supplementary Materials and Methods S1 and S2.
Classification modeling strategies
The classification modeling strategies we considered included linear discriminant analysis (18, 19), quadratic discriminant analysis (20, 21), K-nearest neighbor (22, 23), random forest (8, 24), rotation forest (25, 26), weighted support vector machine (wSVM; refs. 27, 28), distance-weighted discrimination (29, 30), and Ensemble methods, comprising of voting schemes across these classifiers (Supplementary Materials and Methods S3). We split the training dataset of V1 into subsets, containing 332, 332, 332, and 333 samples, respectively, for use in the 4-fold cross-validation model building strategy. For each modeling strategy, we applied the quantile normalization (30) and fit the model to each of the four three-fourth subsets, optimizing tuning parameters using nested cross-validation, and assessed accuracy for predicting the gold standard CMS on the left-out one-fourth subset. Tuning parameters that showed the best accuracy were selected as optimal parameters for each subset. We summarized the predictive accuracy of each modeling strategy as a function of number of genes, allowing us to both assess which modeling strategy appears to be best and the minimum number of genes needed for accurate CMS classification. Choosing the best modeling strategy, we computed the classification accuracy in the validation dataset V2, as well as in the various subsets mentioned above, and summarized results again as a function of number of genes in the model.
Gene ranking strategy
We designed a boosting procedure on the basis of multi-class Adaboost (31) to order the genes (see Supplementary Materials and Methods S4), which amounts to a forward stage-wise additive selection in which samples were repeatedly reweighted at each step so the next best gene focused more on samples misclassified on previous steps, resulting in a list of genes ranked in descending order of classification importance. By using the same reduced gene sets for each classification method, we were able to gain a straightforward comparison of which method appears to perform better, to find the minimum gene set size yielding good classification performance, and to fairly compare the various methods at any desired model size.
wSVM classifier
Our results revealed that the best performing classifier was the wSVM. The user calls the wSVM function with an N by P matrix of expression values for P genes for each of N samples with the column names as Entrez IDs, and the function will quantile normalize the data and apply the wSVM to get class predictions. After pairwise coupling, for sample I, we obtain probabilities of each CMS, πij such that |{\mathop \sum \limits}_{j\ = \ 1}^4 {\pi _{ij}}\ = \ 1$|, with αi = maxj {πij} indicating the highest CMS class probability for that sample, which we consider a measure of CMS classification confidence. We have two possible rules to classify a sample into a CMS group based on these measures:
(i) Most likely CMS: classify sample i into the most likely CMS, {j: πij = max (πij)}, regardless of classification confidence αi.
(ii) Most likely CMS with a confidence threshold: classify sample i into the most likely CMS as long as the classification confidence, αi, is above some threshold λ (e.g., 0.50 or higher), and otherwise consider indeterminate {Choose CMS j: πij = max (πij) if πij > λ, otherwise CMS indeterminate}. Indeterminate samples are heterogenous tumors containing characteristics of multiple CMS, so could also be called “mixed CMS,” as done for 13% of total samples by Guinney and colleagues (8).
Generation of gene signature classifier in colorectal cancer samples
Summary of the approach for development of NanoString classifier
We used a novel strategy to port the CMS classifier designed for Affymetrix platform on FF samples over to the NanoString/FFPE setting that efficiently utilizes the vast information available to us in the CRCSC datasets and overcomes inconsistencies in mRNA quality between FF and FFPE samples (Fig. 1B; Supplementary Fig. S1).
Affymetrix 133-2 Plus2.0– and NanoString CodeSets–based gene expression assays
FF and FFPE tumor samples from randomly selected 175 patients including 95 men and 80 women with stage I–IV colon cancer were included in the first phase of the study to build an FFPE classifier. Seventy-two of 175 samples were included in the CRCSC, therefore, the “gold standard” CMS based on Affymetrix expression array was known and only NanoString assay was run on RNA extracted from FF and FFPE tissue samples of these tumors. Additional FF and FFPE samples from 103 tumors were identified from our institutional biorepository. The “gold standard” CMS was not known for these samples, therefore, Affymetrix 133-2 Plus2.0 was run on FF samples to identify the CMS as the “gold standard.” Subsequently, NanoString assay was run on RNA extracted from FF and FFPE samples from these tumors. All samples were derived from primary colon/rectum resection specimens without preoperative tumor-targeted therapy. The clinicopathologic features of the patient population are shown in Table 1 and Supplementary Materials and Methods S5. Briefly, tumor areas with higher than 60% tumor cellularity from the gland forming tumor and higher than 20% tumor cellularity from signet ring or mucinous tumor were manually macrodissected from FF or FFPE tissue sections to enrich for the demarcated tumor area. In more than two-third samples, superficial and deep (invasive border) areas of the tumor were included for the macrodissection. RNA was extracted using Qiagen's AllPrep DNA/RNA Kits (Qiagen) per the manufacturer's instructions. Each sample was quantitated using the qubit fluorometer, with yields ranging from 0.9 to 26 μg. Each sample was run on the Agilent Bioanalyzer (Agilent Technologies) to determine the RNA integrity number (RIN) for the FF samples and the DV200 value for the FFPE samples. The FFPE samples included in the development of single-sample classifier had 18%–79% of the RNA having greater than 200 intact nucleotides. Gene expression analysis with Affymetrix 133-2 Plus2.0 was performed as described previously (Supplementary Materials and Methods S6). We designed a custom set of NanoString CodeSets (NanoString Technologies) with 472 signature probes, selected from the top 500 genes from the boosting procedure and 28 reference probes. The NanoString CodeSets for each gene were chosen to be the genomic region that was most highly correlated with the fRMA level expression summary of the gene by Affymetrix 133-2 Plus2.0 Probe Set (Affymetrix). The genes with at least 0.70 correlation between the CodeSets and gene-level summary were included in the customized CodeSets. The 28 reference CodeSets were selected from the reference CodeSets on the NanoString PanCan Array Plus, and selecting genes with evidence of no difference in CMS in our preliminary data (32–36). The assay was performed as per the NanoString guidelines (Supplementary Materials and Methods S7) with 10 patient samples, a positive and negative control on each cartridge. Raw data from the nSolver software were transferred to the bioinformatics group where the custom CMS classifier algorithm in the form of R script was used to determine which samples belonged to a particular CMS.
Characteristics . | n (175) . |
---|---|
Age | |
<50 years | 39 |
>50 years | 136 |
Gender | |
Male | 95 |
Female | 80 |
Tumor location | |
Right colon | 75 |
Left and sigmoid colon | 92 |
Rectum | 5 |
Multiple primary tumors | 3 |
pT stage | |
pT1 | 0 |
pT2 | 16 |
pT3 | 133 |
pT4 | 18 |
pT4a | 3 |
pT4b | 1 |
pN stage | |
pN0 | 76 |
pN1 | 55 |
pN1a | 2 |
pN1b | 5 |
pN1c | 0 |
pN2 | 27 |
pN2a | 4 |
pN2b | 2 |
pN3 | 0 |
pNX | 4 |
pM stage | |
pM0 | 165 |
pM1 | 5 |
pMX | 5 |
Grade | |
Low (well or moderately differentiated) | 145 |
High (poorly differentiated) | 30 |
Time between date of surgery and gene expression analysis | |
<5 years | 20 |
5–10 years | 95 |
>10 years | 60 |
Samples | |
Matched FF and FFPE | 149 |
Only FFPE | 12 |
Only FF | 4 |
Characteristics . | n (175) . |
---|---|
Age | |
<50 years | 39 |
>50 years | 136 |
Gender | |
Male | 95 |
Female | 80 |
Tumor location | |
Right colon | 75 |
Left and sigmoid colon | 92 |
Rectum | 5 |
Multiple primary tumors | 3 |
pT stage | |
pT1 | 0 |
pT2 | 16 |
pT3 | 133 |
pT4 | 18 |
pT4a | 3 |
pT4b | 1 |
pN stage | |
pN0 | 76 |
pN1 | 55 |
pN1a | 2 |
pN1b | 5 |
pN1c | 0 |
pN2 | 27 |
pN2a | 4 |
pN2b | 2 |
pN3 | 0 |
pNX | 4 |
pM stage | |
pM0 | 165 |
pM1 | 5 |
pMX | 5 |
Grade | |
Low (well or moderately differentiated) | 145 |
High (poorly differentiated) | 30 |
Time between date of surgery and gene expression analysis | |
<5 years | 20 |
5–10 years | 95 |
>10 years | 60 |
Samples | |
Matched FF and FFPE | 149 |
Only FFPE | 12 |
Only FF | 4 |
Validation at the research molecular diagnostic laboratory
The NanoString CodeSets were technically validated by running two samples with “gold standard” CMS known from the CRCSC data on three different lots of CodeSets. The old lot of CodeSets and new lot of CodeSets were run together in the same run and accuracy in identifying the CMS by these CodeSets was assessed by linear regression. We tested repeatability of CMS assay across four different runs by same technician in 12 samples, reproducibility with different technician in 12 samples, and reproducibility with different input RNA quantity (50–500 ng) for six samples using the same CodeSets on the same nCounter used for prior experiments. We also tested reproducibility of CMS between colonoscopy biopsies and surgically resected primary colorectal cancer by running customized CodeSets on matched biopsies and resection samples, using same CodeSets and nCounter and laboratory personnel.
Assessing performance of CMS classifier assay in a CLIA-certified laboratory
The NanoString assay with top 200 genes (colorectal cancer CMS-200) and top 99 genes (colorectal cancer CMS-100), was further validated at our CLIA-certified MDL to apply this assay as an integral biomarker for a phase II clinical trial (NCT034365630) assessing safety and efficacy of dual TGF-β trap: anti-PD-L1 molecule, M7824 (EMD-Serono), in CMS4 subtype colorectal cancer. Thirty-five tumor samples from stage II/III primary colon cancer, previously used for validation at the research MDL, were used to validate the assay across 10 runs for a total of 120 reactions. All 35 samples were included in the CRCSC study and gold standard CMS was known for these samples, and the laboratory technician was blinded to the gold standard CMS for those samples. Input for the assay was 250 ng of total RNA extracted from FFPE tumor tissue with 20% or higher tumor cellularity. Accuracy, analytic sensitivity, and analytic specificity were assessed by comparing calls from the MDL CMS panel with “gold standard” CRCSC Affymetrix calls. Reproducibility was assessed across original run and at least three additional repeat runs without reextraction of RNA. Repeat runs were also performed with reextracted RNA and by two technicians.
Assessing performance of CMS classifier as a prognostic marker in stage IV colorectal cancer
Patients with a CMS determination from the NanoString-based gene expression score were pooled from three separate sources: clinical trial NCT03435630 (n = 91), phase II clinical trial NCT03428126 (n = 19; assessing trametinib and durvalumab in microsatellite stable colorectal cancer), and NCT01196130 [n = 235; Assessment of Targeted Therapies Against Colorectal Cancer (ATTACC); ref. 37]. The ATTACC samples and the samples from the trametinib/durvalumab study were characterized by colorectal cancer CMS-100 assay at the research MDL, while samples from patients enrolled in M7824 clinical trial were characterized by colorectal cancer CMS-200 performed at the CLIA-compliant MDL. Median overall survival was calculated from date of stage IV diagnosis to death or date of last follow-up, which was censored. Survival patterns were visualized with Kaplan–Meier survival curves and compared using the log-rank test. Graphs were generated using IBM SPSS Statistics 24.
The study was approved by the institutional review board (IRB) with an informed consent obtained from each subject or each subject's guardian for the clinical trial samples. The work on samples from subjects not enrolled in the clinical trial was approved by the IRB with waiver of informed consent. The study has been conducted as per the ethical guidelines of U.S. Common Rule.
Results
Performance of CRCSC classifier on CRCSC datasets
We selected the wSVM model as it had the best performance in the training data V1 using 4-fold cross-validation (Supplementary Fig. S2; Supplementary Table S2; Supplementary Materials and Methods S8). The four-group classification accuracy of the wSVM model on the validation dataset V2 was 0.955 for the full model (5,973 genes), and still outstanding for models involving smaller gene numbers, with four-group classification accuracies of 0.959, 0.932, and 0.898 for models with 500, 75, and 20 genes, respectively. The performance of the wSVM classifier for the out-of-sample subset (V2o), the RNA-sequencing subset [The Cancer Genome Atlas (TCGA)], and the Affymetrix subset (V2a) was comparable with the overall validation performance (V2), suggesting the classifier was robust to platform and has good out-of-sample performance, relatively even across CMS (Supplementary Tables S3–S7; Supplementary Materials and Methods S9). We chose a wSVM classifier with 472 genes to move forward with further validation. This classifier yielded an overall 96.3% classification accuracy in the Affymetrix subset V2a, with accuracies of 0.966, 0.967, 0.932, and 0.971 for CMS1, CMS2, CMS3, and CMS4, respectively. The CMS structure was remarkably persistent, being highly consistent in training and validation datasets (heatmap in Supplementary Fig. S3). Further comparison on our classifier with classifiers described by Guinney and colleagues is shown in the Supplementary Table S8. Performance of 472-gene CRCSC classifier based on single Affymetrix probe gene set and the classifier performance by classification confidence are described in Supplementary Materials and Methods S10 and S11 and Supplementary Fig. S4. Supplementary Table S9 shows all of the misclassified samples along with the corresponding wSVM class probabilities (πij) for each CMS, classification confidence (αi), and indication of whether this sample could be considered a “CMS mixture” (i.e., πij > 0.20 for multiple CMS) and if the “gold standard” was a part of that mixture. From this, we see that most of the “misclassified samples” had lower classification confidence αi, and many had evidence of being CMS mixtures, with the “gold standard” CMS being a component of the mixtures.
NanoString CRCSC classifier optimization based on correlation between FF and FFPE tumor samples
The sample-specific correlations of FF and FFPE measurements were very high for most samples (Fig. 2A; Supplementary Table S10A; Supplementary Materials and Methods S12), with a small number of samples with low correlations tending to have poorer RNA quality for their FF samples (P = 0.0077), but not FFPE samples (P = 0.28; Fig. 2B and C). A histogram of the gene-specific correlation of FF and FFPE measurements for each of the 472 classifier genes is presented in Fig. 2D and summarized in Supplementary Table S10B. This histogram demonstrates the high level of variability across genes in terms of concordance of paired FF/FFPE gene expression measurements, and the remarkable consistency of the gene-specific concordances across batches (Supplementary Fig. S5) suggests that this concordance is a consistent characteristic of the gene/probe set and not a random technical factor. This motivated us to select a subset of genes showing high FF/FFPE concordance for use in our FFPE classifier, choosing the top 100 genes in terms of FF/FFPE correlation for the CMS-100 classifier and the top 200 genes for CMS-200.
NanoString FFPE classifier performance
Figure 3 shows the classification accuracy of CMS-100 on FFPE and FF samples, CMS-472 on FF samples, and the Affy FF-100 and Affy FF-472 based on the Affymetrix validation data V2a, with accuracy split out by confidence threshold, α, and proportion of unclassified samples. The colorectal cancer CMS-100 model applied to FFPE samples had four-group accuracy of 0.80 with 0.81 for CRCSC samples and 0.78 for non-CRCSC samples. For samples with high confidence (αi > 0.80 or 0.90), the performance was better with four-group accuracy of 0.86 and 0.89, respectively. For FF samples, the CMS-100 had four-group classification accuracy of 0.80, with 0.74 for the CRCSC samples and 0.88 for non-CRCSC samples, and four-class accuracy of 0.87 and 0.92 for samples classified with high confidence, αi = 0.80 or 0.90, respectively (Supplementary Materials and Methods S13). These performed comparably with the CMS-472, the 472-gene classifier on FF samples, and not much worse than CMS-100 in an idealized nonclinical setting on the basis of batch-corrected Affymetrix data from the CRCSC studies.
Supplementary Fig. S6A and S6B plot the four-class accuracy versus confidence level αi for FFPE and FF samples, demonstrating that samples classified with high confidence were more likely to be accurately classified. Supplementary Fig. S6C and S6D plot the four-class accuracy versus RNA quality, defined by %200 nt (FFPE) or RIN (FF), demonstrating that there is little, if any, association of CMS accuracy with RNA quality, suggesting that the performance of classifier is robust to RNA quality in this study. One gene of the 100 was mistakenly left off an order of the NanoString CodeSets for some of the validation studies, so the corresponding classifier, CMS-100 that was validated has 99 genes. We confirmed the performance of the 100 and 99 gene classifier was concordant.
The colorectal cancer CMS-100 assay with 99 genes was 100% reproducible in predicting a CMS across different runs (12 samples = 48 runs), between two laboratory personnel (12 samples) and with different RNA input concentration (n = 6). The reproducibility between biopsy and resection samples was 91% with 15 of 17 patients having same CMS between matched biopsy and resection specimens (Supplementary Table S11). All (12 from left colon and five from right colon) biopsy samples were procured from same tumor as surgically resected specimens. Tissue sections from RNA were derived from FFPE blocks that were generated for the clinical use. The two cases with discrepant CMS between biopsy and resection were sporadic colorectal cancer without any known predisposing condition or preoperative tumor-targeted therapy. To determine impact of tumor location and histopathologic features on reproducibility of CMS, another pathologist reviewed hematoxylin and eosin–stained sections of primary tumor from surgical resections of those included in assessment of inter-run reproducibility, InterTech reproducibility, reproducibility across different RNA concentration, and reproducibility between biopsy and resection (n = 30). In 19 samples, both superficial and deep area of the tumor were macro dissected, four samples had only superficial and seven had only deep area of the tumor macro dissected. Five tumors were poorly differentiated, one with mucinous histology, and 25 tumors were moderately differentiated, including one with mucinous histology. Because of high reproducibility across runs, technicians, and RNA concentration, we did not observe any difference in CMS call among samples with different areas of macrodissection or histologic parameters. We also did not observe significant difference in probability of a CMS in the context of histologic parameters. However, two (of 17) samples that showed discrepancy for CMS between biopsy and surgical resection had only deep area of the tumor macro dissected from the resection specimens. We also did not find histologic features unique to seven samples that were discrepant for the CMS between research laboratory and CLIA-certified laboratory.
Performance of colorectal cancer CMS-200 in CLIA-certified MDL
On initial run, 32 of 35 samples were accurately assigned the CMS as compared with the gold standard based on “most likely CMS,” that is, with confidence threshold of αi > 0.50. Three misclassified samples, with 0.50 and 0.57 “most likely CMS” probability on the initial run had “most likely CMS” probability in the borderline range (≥0.43 and <0.57) on one of the repeat runs. They were classified as mixed CMS as they had almost equal probability of two CMSs (CMS2 and CMS4) with one of them matching the gold standard CMS. With 0.50 most likely CMS probability, these three samples were considered errors when forcing a single CMS call, resulting in 91% analytic sensitivity and specificity. However, if a confidence threshold of >0.57 was used, then all three samples in all runs had CMS as per the gold standard, and the assay would have 100% analytic sensitivity and analytic specificity. Inter-run reproducibility was assessed from three separate extractions from four unique patient samples for a total of 12 cases. These 12 cases were run across three separate NanoString runs and by two technologists. There was 100% concordance for the CMS classification among all three runs, with an average SD of ±0.002 for the “most likely CMS” probability. The InterTech reproducibility was 100% for the CMS classification between both technicians, with an average SD of ±0.002 for the CMS probability. Intrarun reproducibility was assessed among four samples run in triplicate on a single NanoString run. There was 100% concordance for the CMS classification among all three runs, with an average SD of ±0.012 for the CMS probability. Comparing CMS reproducibility with CMS-100 (99 genes) versus 200-gene assay demonstrated 97% reproducibility, with only one of 35 samples showing discordant CMS. List of CMS-100 (99 genes) and CMS-200 test and 16 housekeeping genes is shown in Supplementary Table S12. These reproducibility and repeatability findings were deemed up to the level of a CLIA-certified assay to determine CMS4 versus other CMS for FFPE tumor samples from patients enrolled in the clinical trial targeting patients with CMS4 colorectal cancer, as described in the Materials and Methods.
KRAS-BRAF mutational status and prognostic relevance of CMS by the NanoString CMS classifier
To confirm the expected biologic performance of the assay, we surveyed a set of patients with metastatic colorectal cancer enrolled in clinical trials and ATTACC protocol (Table 2). Higher frequency of KRAS mutation was observed with CMS3 (66%) and CMS4 (50%) samples. BRAF mutations were identified only in CMS1 (50%), CMS4(8%), and mixed (12%) subtype samples (Fig. 4A). We did not find significant difference in any of the clinicopathologic and molecular characteristics between samples that were classified as mixed versus those that were classified into one of the CMSs. Using the CMS-100 (99 genes) classifier, we were able to identify significant differences in overall survival by CMS, consistent with prior studies (9, 14). Specifically, patients with a CMS2 tumor had the best survival, with a median of 46.1 months from stage IV diagnosis [95% confidence interval (CI), 36.6–58.1], and patients with a CMS1 or CMS3 tumor had the poorest survival after a stage IV diagnosis, with median survival times of 23.2 (95% CI, 19.3–59.2) or 21.4 (95% CI, 15.8–34.6) months, respectively. Patients with a CMS4 tumor had a survival pattern that was in between that of CMS2 and CMS1 or CMS3, with a median survival time of 35.3 months (95% CI, 32.2–40.0; Fig. 4B).
. | N = 345 . |
---|---|
Mean age at initial diagnosis (SD) . | 50.9 (11.5) . |
Mean age at stage IV diagnosis (SD) . | 51.5 (11.5) . |
Sex | |
Male | 187 (54.2) |
Female | 158 (45.8) |
Race/ethnicity | |
NH White | 258 (74.8) |
NH African American | 32 (9.3) |
Hispanic | 27 (7.8) |
NH Asian | 22 (6.4) |
Other/unknown | 6 (1.7) |
Stage at initial diagnosis | |
I | 5 (1.4) |
II | 24 (7.0) |
III | 110 (31.9) |
IV | 204 (59.1) |
NA | 2 (0.6) |
KRAS mutation status | |
Wild-type | 107 (31.0) |
Canonical mutation | 147 (42.6) |
NA | 91 (26.4) |
NRAS mutation status | |
Wild-type | 236 (68.4) |
Canonical mutation | 17 (4.9) |
Noncanonical mutation | 1 (0.3) |
NA | 91 (26.4) |
BRAF mutation status | |
Wild-type | 229 (66.4) |
v600 | 20 (5.8) |
Other mutation | 5 (1.4) |
NA | 91 (26.4) |
MSI status | |
MSS | 177 (51.3) |
NA | 168 (48.7) |
CMS | |
1, Immune | 12 (3.5) |
2, Canonical | 117 (33.9) |
3, Metabolic | 21 (6.1) |
4, Mesenchymal | 161 (46.7) |
Mixed | 34 (9.9) |
. | N = 345 . |
---|---|
Mean age at initial diagnosis (SD) . | 50.9 (11.5) . |
Mean age at stage IV diagnosis (SD) . | 51.5 (11.5) . |
Sex | |
Male | 187 (54.2) |
Female | 158 (45.8) |
Race/ethnicity | |
NH White | 258 (74.8) |
NH African American | 32 (9.3) |
Hispanic | 27 (7.8) |
NH Asian | 22 (6.4) |
Other/unknown | 6 (1.7) |
Stage at initial diagnosis | |
I | 5 (1.4) |
II | 24 (7.0) |
III | 110 (31.9) |
IV | 204 (59.1) |
NA | 2 (0.6) |
KRAS mutation status | |
Wild-type | 107 (31.0) |
Canonical mutation | 147 (42.6) |
NA | 91 (26.4) |
NRAS mutation status | |
Wild-type | 236 (68.4) |
Canonical mutation | 17 (4.9) |
Noncanonical mutation | 1 (0.3) |
NA | 91 (26.4) |
BRAF mutation status | |
Wild-type | 229 (66.4) |
v600 | 20 (5.8) |
Other mutation | 5 (1.4) |
NA | 91 (26.4) |
MSI status | |
MSS | 177 (51.3) |
NA | 168 (48.7) |
CMS | |
1, Immune | 12 (3.5) |
2, Canonical | 117 (33.9) |
3, Metabolic | 21 (6.1) |
4, Mesenchymal | 161 (46.7) |
Mixed | 34 (9.9) |
Abbreviations: MSI, microsatellite instability; MSS, microsatellite stable; NA, not available; NH, non-Hispanic.
Discussion
CMS has great potential to reshape the landscape of colorectal cancer treatment and contribute to the development of new precision therapeutic approaches. However, to realize this potential, it is necessary to transform the CMS based on network analysis of multiple gene expression datasets into a clinical test, which requires an assay that is reproducible across platforms and tissue types, has high classification accuracy, and is able to generate CMS in a single-sample setting. We achieved this objective using a multi-step approach for building the classifier: in silico testing of various classification strategies considering various gene list sizes on CRCSC data generated on Affymetrix platform and determining that wSVM is the optimal system, a gene reduction exercise to select genes with best concordance across gene expression profiling platforms and tissue types to ensure optimal performance in FFPE samples, updating the wSVM using the CRCSC training set based on this reduced number of genes, and then using this classifier on measurements from the NanoString assay on FFPE samples after transforming these values onto the scale of the Affymetrix data on FF samples that dominated the CRCSC training set.
Rather than just choosing a classification strategy in an ad hoc fashion, we used a systematic, rigorous strategy to rank the genes based on their classification value and to compare a large number of classification strategies for a wide range of model sizes. This allowed us to find out which strategy performed best, the wSVM, and to determine how parsimonious a classifier could be without sacrificing substantial classification accuracy. Given that the classification literature clearly demonstrates that no one classification strategy is optimal for all datasets, the consideration of multiple approaches is important when building classification signatures. Moreover, findings from the Microarray Quality Control project from FDA have shown that even slight differences in the statistical analysis led to discrepancies in biological interpretation (38). High accuracy in predicting CMS by nearly all statistical methods gives credence to the utility of our CMS assay in accurately classifying a colon cancer in one of the CMSs.
The wSVM classifier we built using this strategy was applied to training data, consisting largely of batch-corrected Affymetrix gene expression measurements from FF samples, and performed exceptionally well in the CRCSC validation data. Our custom design of CodeSets best capitulating the signal in our training data, and our strategy of starting with more genes than necessary, then narrowing to a subset with evidence of high FF/FFPE correlation further mitigated the influence of FFPE on classification performance. Our quantile normalization strategy was sufficient to obtain reasonable performance for small (99 or 200) gene FFPE NanoString classifier. This strategy allowed us to efficiently utilize our data resources, using the enormous data on FF samples to train the classifier, and collecting a smaller set of paired FF/FFPE samples to identify genes with high FF/FFPE concordance and map the FFPE NanoString expression values to the scale of the FF Affymetrix expression values, leading to our novel strategy for building the classifier. The consistency of gene-specific FF/FFPE concordance across batches provides strong support for this strategy.
The high concordance we observed in CMS for colorectal cancer samples between a research molecular testing laboratory and a CLIA-certified clinical MDL indicate robust performance of the assay. High interlaboratory reproducibility is likely due to similarities in the preanalytic and analytic processes between two laboratories. Another reason for our high interlaboratory reproducibility is the use of NanoString nCounter technology that utilizes nonamplified nucleic acids without any reverse transcription step and is applicable to multiple samples. Ragulan and colleagues (39) demonstrated high classification accuracy and reproducibility of NanoString-based subtyping classification between FF and FFPE tissue samples of colorectal cancer. In spite of significant differences in the colorectal cancer classes and validation approaches, this study and our study indicate that NanoString is a reliable platform to develop and validate gene expression–based signature using FFPE samples of colorectal cancer.
Guinney and colleagues (8) found that approximately 87% of colorectal cancer tumors classified cleanly into a single CMS, but approximately 13% were “mixed CMS,” not outliers or a fifth subtype, but heterogeneous samples containing characteristics of multiple CMS. We also found similar proportions of “mixed CMS” samples in our analyses. Clinically, patients with mixed CMS tumors could be treated in multiple ways. One option would be to include any “mixed CMS” sample with a high enough probability of CMSx as a potential candidate for any targeted therapy that has been validated as a precision therapeutic for CMSx (x = 1, 2, 3, or 4), which of course would require prospective validation before clinical application.
There is increasing evidence of the prognostic and predictive utility of CMS. Lenz and colleagues (9) using a NanoString-based assay in a large cohort of patients with metastatic or advanced colorectal cancer enrolled in CALGB/SWOG 80305 phase III clinical trial, demonstrated that there is significant difference in overall survival by CMS, with median survival of 40 months in CMS2 versus median survival of 15 months in CMS1. The NanoString assay used for the CALGB/SWOG 80305 and our study differed significantly. Lenz and colleagues developed customized NanoString-based genes that were derived from some of the large datasets with published gold standard CMS labels, including, TCGA and other studies (5, 13). Only genes that are common to these three datasets and those assessed in the CALGB/SWOG 80405 panel were used. The genes included in our NanoString-based assay were all derived from CRCSC database. Similar prognostic trends were observed by other groups, including patients enrolled in FIRE3 study comparing cetuximab versus bevacizumab with FOLFIRI in patients with metastatic colorectal cancer, and by Mooi and colleagues AGITG MAX clinical trial (14). As research-only classifiers, these methodologies are not designed for application for individual patients or suitable for use in prospective patient assignment. In contrast, our classifier, as deployed in a clinical laboratory, is suitable for classifying individual patients with the rigor needed for guiding clinical management.
Our CLIA-validated assay has potential of integral, integrated, and exploratory marker. Hypotheses being explored include focused immunotherapy in CMS1, which represents a subgroup with evidence of higher immune infiltrates and activated T cells. CMS2 represents a group with best overall survival from EGFR inhibition in retrospective assessment of the CALGB/SWOG 80405 trial, while CMS1 benefited from VEGF inhibition (PMID 31042420). CMS4 has active stromal signature and an immune-modulating strategy has been proposed. For example, in the clinical trial (NCT034365630) assessing safety and efficacy of dual TGF-β trap: anti-PD-L1 molecule, M7824 (EMD-Serono), CMS assay was used as an integral biomarker to select patients with CMS4. Expanded efforts in this trial or other ongoing trials can be done, looking to identify other CMS where efficacy of either M7824 or other drug can be assessed on the basis of its mechanism of action and CMS biology. As an integrated assay, all patients can be prospectively tested to identify CMS. The interim analysis then looks at all comers, and if negative, then looks at CMS-specific subgroups, with plan to continue the second half of the randomized study. Finally, as an exploratory biomarker, a retrospective analysis can be done with the high-quality CLIA assay to look for a CMS signal, but also to minimize the risk of inconsistent assays when designing the follow-up study. To support the goal of dissemination of a robust CMS classifier for retrospective or prospective utilization, the NanoString CodeSets and supporting bioinformatics information can be found at https://bioinformatics.mdanderson.org/apps/CMSclia/.
Unavailability of matched samples prevented us from assessing CMS accuracy between primary and metastatic tissues in our study. Fontana and colleagues (40) by using publicly available data from Khambata-Ford dataset (41), demonstrated no significant difference in CMS distribution between localized versus metastatic disease. The impact of sample site on CMS classification is necessary to determine host organ influence and metastasis-associated evolution of gene expression in colorectal cancer.
In summary, we have developed, validated, and demonstrated prognostic utility of a colorectal cancer CMS assay using FFPE samples. This CLIA-validated assay provides a foundation to expand its utility to assess prognosis in a standard-of-care setting and explore the assay as a predictor of response to therapy in clinical trials.
Authors' Disclosures
J.S. Morris reports a patent for WO2020/206136 A2 pending. R. Luthra reports other from EMD Serono during the conduct of the study. N.G. Reddy reports grants from EMD Serono (clinical trial funding) during the conduct of the study. V.K. Morris reports other from EMD Serono, Bristol Myers Squibb, and Immatics, and grants from NIH/NCI K12 award during the conduct of the study; V.K. Morris also reports personal fees and other from Pfizer, and personal fees from Servier, Incyte, and Boehringer-Ingelheim outside the submitted work. I.I. Wistuba reports grants and personal fees from Genentech/Roche, Merck, Asuragen, HTG Molecular, Pfizer, AstraZeneca/Medimmune, and Bayer; personal fees from MSD, Oncocyte, Guardant Health, GlaxoSmithKline, Bristol-Myers Squibb, Flame, Platform Health, and Daiichi Sankyo; and grants from Adaptive, Adaptimmune, EMD Serono, Takeda, Johnson & Johnson, Amgen, Karus, Iovance, 4D, Novartis, and Akoya outside the submitted work. S. Koptez reports a patent for USSN 62/828,098 pending; stock and other ownership interests in MolecularMatch, Navire, and Lutris; consulting or advisory role with Roche/Genentech, EMD Serono, Merck, Karyopharm Therapeutics, Amal Therapeutics, Navire Pharma, Symphogen, Holy Stone, Amgen, Novartis, Lilly, Boehringer Ingelheim, Boston Biomedical, AstraZeneca/MedImmune, Bayer Health, Pierre Fabre, Redx Pharma, Ipsen, Daiichi Sankyo, Natera, HalioDx, Lutris, Jacobio, Pfizer, and Repare Therapeutics; and research funding (to institution) from Sanofi, Biocartis, Guardant Health, Array BioPharma, Genentech/Roche, EMD Serono, MedImmune, Novartis, Amgen, Lilly, and Daiichi Sankyo. D.M. Maru reports grants from EMD-Serono Inc. and NCI during the conduct of the study, as well as has a patent for colorectal cancer consensus molecular subtype classifier CodeSets and methods of use thereof, USSN 62/828,098. No disclosures were reported by the other authors.
Authors' Contributions
J.S. Morris: Conceptualization, data curation, software, formal analysis, supervision, funding acquisition, validation, investigation, visualization, methodology, writing-original draft, project administration, writing-review and editing. R. Luthra: Conceptualization, resources, data curation, formal analysis, supervision, validation, methodology, project administration. Y. Liu: Conceptualization, data curation, formal analysis, investigation, methodology. D.Y. Duose: Resources, data curation, formal analysis, supervision, validation, methodology, project administration. W. Lee: Conceptualization, data curation, software, methodology. N.G. Reddy: Resources, formal analysis, supervision, validation, methodology, project administration. J. Windham: Data curation, formal analysis, validation, methodology. H. Chen: Data curation, software, formal analysis, validation, methodology, writing-review and editing. Z. Tong: Data curation, project administration. B. Zhang: Data curation, methodology. W. Wei: Data curation, methodology. M. Ganiraju: Data curation, software, visualization. B.M. Broom: Conceptualization, supervision. H.A. Alvarez: Formal analysis, validation, methodology. A. Mejia: Data curation, project administration. O. Veeranki: Data curation, writing-review and editing. M.J. Routbort: Software, methodology. V.K. Morris: Resources, supervision, funding acquisition, validation. M.J. Overman: Resources, funding acquisition, investigation, project administration, writing-review and editing. D. Menter: Conceptualization, resources, data curation, formal analysis, supervision, funding acquisition, methodology, project administration, writing-review and editing. R. Katkhuda: Data curation, formal analysis, methodology, project administration. I.I. Wistuba: Resources. J.S. Davis: Resources, formal analysis, supervision, validation, methodology, writing-original draft, writing-review and editing. S. Kopetz: Conceptualization, resources, supervision, funding acquisition, validation, investigation, methodology, writing-original draft, project administration, writing-review and editing. D.M. Maru: Conceptualization, resources, data curation, software, formal analysis, supervision, funding acquisition, validation, investigation, visualization, methodology, writing-original draft, project administration, writing-review and editing.
Acknowledgments
This work was funded by NCI through Assay Validation For High Quality Markers For NCI-Supported Clinical Trials (UH2CA207101) and MD Anderson Cancer Center SPORE in Gastrointestinal Cancer (P50 CA221707). Part of this research was performed in MD Anderson's Core facilities which was supported, in part, by the NIH through Cancer Center Support Grant CA016672. Part of the validation work in the clinical laboratory was funded as part of a clinical trial (NCT03436563) funded by EMD-Serono. We thank Kim-Anh Vu in MD Anderson's Department of Anatomic Pathology for helping with the figures.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.