Abstract
Genome-wide RNA expression profiling has yielded tumor subtypes with strong predictive or prognostic value for a wide variety of cancers. Recently, for breast cancer two RNA expression classifiers have been adopted by the World Health Organization (WHO) and approved by the Food and Drug Administration (FDA). Also on the basis of DNA copy number profiles, tumor subtypes with different prognosis have been described, but have not yet led to clinical implementation. The genomic revolution caused by next generation sequencing of DNA samples presents additional mutation, balanced translocations, single-nucleotide polymorphisms (SNP), and copy neutral loss of heterozygosity data simultaneously. We foresee a further boost of the potential of DNA profiling in the clinic when these multidimensional DNA factors will be implemented. Here we evaluate the current stratification power with DNA copy numbers. In a training and validation approach using data of 400 published breast cancer samples, we show that a DNA copy number classifier accurately classifies RNA expression subtypes. We consider this an important step forward for clinical implementation of genomic subtyping using DNA and discuss the extra dimensions upcoming techniques will bring to the DNA palette. Clin Cancer Res; 17(15); 4959–64. ©2011 AACR.
Distinct DNA copy number profiles have been described for a wide variety of malignancies. Several copy number subtypes have different prognostic or predictive value, of which some were validated in independent data sets. DNA is inherently stable and can be obtained from routinely collected archival material, which makes DNA attractive for diagnostic purposes. Here we evaluate the current stratification power of DNA copy numbers. Therefore, we use samples previously classified using expression array analysis. DNA copy numbers allow accurate subtype classification. With the additional dimensions, like balanced translocations, mutation and loss-of-heterozygosity detection, next generation sequencing brings to DNA profiling, the expectations are high for DNA diagnostics and personalized treatment.
Introduction
Personalized treatment is dependent on biomarkers or molecular profiles with strong prognostic or predictive value. Classically morphologic features are used in the pathology laboratory, but metabolites, proteins, DNA, RNA, noncoding RNA, epigenetic features could be a valuable contribution. Whether used separately or in combination, accuracy, reproducibility, and robustness are the most important characteristics to warrant clinical implementation (1). If high-throughput and genome-wide techniques yield subtypes with strong predictive or prognostic value, faster, less invasive, and more cost effective procedures, such as real-time-PCR, multiplex ligation-dependent probe amplification (MLPA), immunohistochemistry, or (molecular) imaging may be implemented in the clinic. Of all genome-wide high throughput techniques, gene-expression profiles have the strongest prognostic value in a wide variety of malignancies, i.e., leukemias and bladder, breast, esophageal, non-small cell lung, and head and neck cancers (2). Two breast-cancer expression signatures, MammaPrint and Oncotype DX, are adapted by the FDA or are in clinical trial (phase III) for outcome prediction validation (3). This rather low number of expression signatures introduced into the clinic reflect hurdles in validation, implementation, standardization, and definitions of cancer subtypes (4).
DNA copy number aberrations to define tumor subtypes with different prognosis have also been described (5–8). Specific amplifications and gene fusions such as BCR-ABL, EGFR, nRAS, and ERBB2 can be decisive predictive factors (9). DNA copy number profiles detected by array comparative genomic hybridization (CGH), however, have not yet been implemented in the clinic. For breast cancer, the prognostic value of the current DNA copy number profiles has been less pronounced and less tightly related to clinico-pathologic classifiers such as estrogen receptor status or ERBB2 expression (5, 6). A significant proportion of RNA expression can be explained in terms of underlying DNA copy number aberrations (10). However, RNA expression profiles are immediately responsive to circadian rhythms, temperature, or other environmental factors, such as (pre-)operative patient treatment including drugs, chemo-radiotherapy, and anesthesia (11, 12). These factors do not directly impact DNA copy number aberrations and stromal infiltrate can be effectively corrected for in the analysis procedures (5, 13). DNA copy number profiles can be obtained using routinely collected formalin fixed paraffin embedded (FFPE) tissue blocks (14, 15).
The aim of this perspective is to evaluate the potentials of molecular subtyping using DNA. We chose ductal invasive breast cancer as a test case, which is the most well investigated and profiled solid tumor in this respect. We used a data collection of nearly 400 publicly available breast-cancer samples, obtained from 4 breast-cancer studies where both DNA copy number (array CGH) as well as RNA expression arrays were carried out (5, 16–18). Here we assess the power of only DNA data and assess if individual patients can be assigned to a subclass. Using a training and validation approach we build a DNA classifier, which accuracy mimics the strong prognostic expression subtypes. This exercise illustrates the potential of DNA as a subtype classifier. By next generation sequencing, mutation data, balanced translocations, single-nucleotide polymorphisms (SNP), and copy neutral loss of heterozygosity can be simultaneously obtained from DNA (7, 19). We foresee a further boost of the potential of DNA profiling when these multidimensional factors will be implemented.
An illustration of the potential of DNA for molecular subtyping, using copy numbers only
Chromosomal DNA aberrations underlying expression subtypes are consistent in 4 breast-cancer datasets.
By gene-expression profiling, ductal invasive breast cancers have been classified into 5 expression subtypes: ErbB2, Basal-like, Luminal A, Luminal B, and Normal-breast like, in at least 7 independent studies (20–22). Three of the 5 initial subtypes (ErbB2, Basal-like, and Normal-breast like) were relatively easy to distinguish from each other and from the other 2 subtypes (Luminal A and B). To distinguish Luminal A from B, however, was less trivial, despite the large differences in survival of these subtypes (23). Normal-breast like subtype was left out of our analysis, mainly because of the recent reports by Parker and colleagues (24) and Weigelt and colleagues (4), who described the subtype as a potential artifact of a high percentage of normal “contamination” in the tumor specimen.
DNA copy number and RNA expression subtype information of 511 breast cancer tumors were collected from the public domain (5, 16–18; Table 1). For details, see supplementary information. Samples of which gene expression subtype information was missing were excluded, resulting in a total of 355 samples. The samples of these 4 datasets, further referred to as Stanford (n = 83; Ref. 17), UCSF (n = 89; Ref. 18), Cambridge (n = 98; Ref. 5) and Paris (n = 85; Ref. 16), were primarily of ductal invasive origin (Table 1). We adopted the expression subtypes that were previously determined by each institutions expression array platforms, clustering algorithms, and settings. The majority of samples were characterized as the Luminal A expression subtype in all 4 datasets (mean 41%, range 36%–53%). The Basal subtype was the second largest group (mean 20%, range 14%–28%). The remaining samples were equally distributed over the ErbB2 (mean 15%, range 12%–23%) and Luminal B subtypes (mean 14%, range 12%–18%; Table 1; Refs. 5, 16–18). To make the 4 different datasets comparable, probes of the respective platforms (i.e., BACs and oligonucleotides) were remapped to the human genome assembly release 19, NCBI 37 (14). Subsequently, direct combination of data from the different cohorts was carried out in a straightforward manner by determining the actual DNA copy number data per array sample (13). Frequencies of segmented and called DNA copy numbers were then calculated and plotted per segment and subtype for the cumulative set (Fig. 1) and for each of 4 separate patient cohorts (Supplementary Fig. S1). DNA frequency plots per RNA expression subtype showed very similar results for the 4 datasets separately, despite demographic and technical differences as well as differences in expression subtype assignments. We therefore conclude that subtyping on basis of DNA copy number aberrations is highly correlated with RNA expression subtypes, validating the findings of Bergamaschi and colleagues (17).
Frequencies of copy number alterations among the different breast cancer GE subtypes in the matched data set. Frequency plots of chromosomal copy number aberrations detected in the combined samples set of Stanford (n = 83; Ref. 1), UCSF (n = 89; Ref. 4), and Cambridge (n = 98; Ref. 5), for subtypes luminal A (A), luminal B (B), ERBB2+ (C), and basal-like (D). Frequencies of losses in red, gains in green, and amplifications flagged in blue if present in more than 10% of the tumors. Regions altered with frequencies significantly different between subtypes are highlighted with black horizontal bars (P < 0.05, FDR < 0.10). * regions significant due to absence of aberrations compared with the other subtypes. Significance is calculated by multivariate analysis.
Frequencies of copy number alterations among the different breast cancer GE subtypes in the matched data set. Frequency plots of chromosomal copy number aberrations detected in the combined samples set of Stanford (n = 83; Ref. 1), UCSF (n = 89; Ref. 4), and Cambridge (n = 98; Ref. 5), for subtypes luminal A (A), luminal B (B), ERBB2+ (C), and basal-like (D). Frequencies of losses in red, gains in green, and amplifications flagged in blue if present in more than 10% of the tumors. Regions altered with frequencies significantly different between subtypes are highlighted with black horizontal bars (P < 0.05, FDR < 0.10). * regions significant due to absence of aberrations compared with the other subtypes. Significance is calculated by multivariate analysis.
Features of training and validation data-sets
. | . | Training set . | Validation set . | |||
---|---|---|---|---|---|---|
. | Publication . | Stanford (17) . | UCSF (18) . | Cambridge (5) . | Training data-set . | Paris (16) . |
Public resource | SMD | CaBIG repository | GEO | ArrayExpress | ||
Array CGH platform | cDNA | BAC | Oligo-nucleotide | Oligo-nucleotide | ||
Original freeze | March 2006 | July 2003 | March 2006 | March 2006 | ||
Features | Total | 39,632 | 3,424 | 27,801 | 42,418 | |
Includeda | 21,694 | 2,149 | 27,801 | 27,801 | 42,418 | |
Samples | Total | 89 | 145 | 171 | 405 | 106 |
Luminal A | 37 | 42 | 52 | 131 | 31 | |
Luminal B | 15 | 11 | 13 | 39 | 15 | |
ErbB2+ | 19 | 11 | 14 | 44 | 15 | |
Basal-like | 12 | 25 | 19 | 56 | 24 | |
n (included)b | 83 | 89 | 98 | 270 | 85 |
. | . | Training set . | Validation set . | |||
---|---|---|---|---|---|---|
. | Publication . | Stanford (17) . | UCSF (18) . | Cambridge (5) . | Training data-set . | Paris (16) . |
Public resource | SMD | CaBIG repository | GEO | ArrayExpress | ||
Array CGH platform | cDNA | BAC | Oligo-nucleotide | Oligo-nucleotide | ||
Original freeze | March 2006 | July 2003 | March 2006 | March 2006 | ||
Features | Total | 39,632 | 3,424 | 27,801 | 42,418 | |
Includeda | 21,694 | 2,149 | 27,801 | 27,801 | 42,418 | |
Samples | Total | 89 | 145 | 171 | 405 | 106 |
Luminal A | 37 | 42 | 52 | 131 | 31 | |
Luminal B | 15 | 11 | 13 | 39 | 15 | |
ErbB2+ | 19 | 11 | 14 | 44 | 15 | |
Basal-like | 12 | 25 | 19 | 56 | 24 | |
n (included)b | 83 | 89 | 98 | 270 | 85 |
aFeatures were only included when of more than 80% of the samples a value was present and when genomic annotation was present.
bOnly samples of which gene expression subtype information was given were included.
Abbreviations: SMD, Stanford Microarray Database; CaBIG, Cancer Biomedical Informatics Grid; GEO, Gene Expression Omnibus.
A DNA copy number classifier can predict expression subtypes of high prognostic value with high accuracy.
To built a subtype classifier based on DNA copy numbers we combined the first 3 datasets published (5, 17, 18). The essence is that for each closest corresponding chromosomal location on the “Cambridge” CGH platform the segmented DNA copy number level of the Stanford and UCSF dataset were taken by making use of a gene-expression integration R-package called intCNGEan (25). The collective training set of 296 breast-cancer samples (Table 1) enabled the statistical analysis of distinct copy number aberrations per RNA–expression subtype by multivariate analysis (Supplementary Table S1). The significant distinct DNA copy number aberrations found for each of the 4 subtypes reflected the aberrations as detected using pair-wise univariate analysis (1 versus the rest) on the four datasets separately (Supplementary Fig. S1). The distinct aberrations per subgroup are taken as the features to construct a representative vector of DNA copy aberration distributions for all selected features. The DNA copy number profile of a sample from the Paris data set is compared with the representative vectors of all subgroups, i.e., the distance between the two is calculated. Subsequently, the sample is assigned to the label of the subgroup with the smallest distance between its representative vector and the DNA copy number profile of the sample. In this way, all samples from the Paris dataset (16) are classified (Table 1). A summary table of genomic regions significantly gained or lost in the combined data set and used for classifier training is provided in Supplementary Table S1.
The classifier carried out with an accuracy of ∼70% to 90% on the samples of the Paris validation set (Table 2). This level of accuracy for subtype prediction is high, particularly given that expression subtypes are a weak end point given the demographic and technical differences in addition to the variations in expression subtype assignment by the different research groups (4). Strikingly, the prediction for the Luminal B subtype with strong prognostic value is the most accurate, (91%), whereas the distinction between Luminal A and B is the most challenging by RNA expression profiling (21, 22). The predicted accuracy was however the lowest for the Luminal A subtype (73%), which might be attributable to the less distinct DNA copy number aberrations (Fig. 1). Overall, the exercise with this breast-cancer dataset and copy number data shows a strong potential of DNA for subtyping in the clinic. Nevertheless, the unsupervised hierarchical clustering of DNA copy numbers do not yet yield the same strong prognostic value (5, 6). This discrepancy may be due to the quality of the algorithms or the (dimensionality) of the array CGH data (13).
n = 85 . | Gene-expression subtype . | Accuracy . | |||
---|---|---|---|---|---|
. | Luminal A . | Luminal B . | ErbB2 . | Basal-like . | . |
Prediction | |||||
Luminal A | 28 | 6 | 8 | 6 | 0.73 |
Luminal B | 1 | 11 | 0 | 1 | 0.91 |
ErbB2 | 1 | 0 | 3 | 0 | 0.85 |
Basal-like | 1 | 0 | 2 | 17 | 0.88 |
n = 85 . | Gene-expression subtype . | Accuracy . | |||
---|---|---|---|---|---|
. | Luminal A . | Luminal B . | ErbB2 . | Basal-like . | . |
Prediction | |||||
Luminal A | 28 | 6 | 8 | 6 | 0.73 |
Luminal B | 1 | 11 | 0 | 1 | 0.91 |
ErbB2 | 1 | 0 | 3 | 0 | 0.85 |
Basal-like | 1 | 0 | 2 | 17 | 0.88 |
Discussion
Multidimensional DNA profiling has a strong potential for tumor subtyping in the clinic
Profiling based on chromosomal DNA copy number aberrations alone has specified cancer subtypes with different prognosis. Here, we show that even prediction analysis of prognosis using a copy number based classifier is a promising alternative for the clinic. It is important to reevaluate the current shortcomings of DNA copy number based subtyping in terms of prognosis and prediction. For clustering and subtype annotation the amount of variables measured is important (26). The actual copy number data of the four different studies used in this exercise is of reasonable high resolution, but does not detect focal aberrations as described by recent studies using the latest array CGH platforms (14, 19, and 27). In contrast, RNA expression values are strictly separate measurements and genome-wide expression arrays thus have at least 20.000 variables. Unfortunately, the implementation and validation for RNA expression based prediction assays in the clinic has been hampered by skepticism about reproducibility between variable platforms and different analysis strategies (28). Weigelt and colleagues compared three array expression based single sample predictors (SSP) for the molecular classification of breast cancer, which can be described as a kind of multidimensional average of sets of genes that determines a subtype (4). This elegant comparison showed that each SSP identified tumor subtypes with similar survival, but not every individual patient was assigned to the same tumor subtype. This emphasized the need for stringent method standardization and definitions before subtyping can be considered for clinical implementation. Direct combination of data from different array CGH platforms is straight forward (7), particularly if data are reduced back to the actual copy number data and copy number aberration–based centroids are built, as we show here. Even integration with quantitative next generation sequencing data would be trivial, because this is also applicable on archival formalin-fixed, paraffin-embedded (FFPE) material (29, 30, Fig. 2).
Examples of copy number analysis by massive parallel sequencing (MPS) and array CGH at comparable cost and cDNA input concentration. Both analyses are carried out at VUMC on the same DNA isolated from FFPE material of a colorectal tumor. A, chromosome 18 analyzed by MPS (Illumina Genome Analyzer IIx [GAIIx]; Illumina Inc.), 0,2 fold coverage; B, chromosome 18 analyzed by array CGH using Agilent Human Genome 105K CGH microarrays (Agilent Technologies); the copy number profiles are similar, except for an additional focal gain at 18p11.32 visible in the MPS profile.
Examples of copy number analysis by massive parallel sequencing (MPS) and array CGH at comparable cost and cDNA input concentration. Both analyses are carried out at VUMC on the same DNA isolated from FFPE material of a colorectal tumor. A, chromosome 18 analyzed by MPS (Illumina Genome Analyzer IIx [GAIIx]; Illumina Inc.), 0,2 fold coverage; B, chromosome 18 analyzed by array CGH using Agilent Human Genome 105K CGH microarrays (Agilent Technologies); the copy number profiles are similar, except for an additional focal gain at 18p11.32 visible in the MPS profile.
The dimensionality options of genome-wide chromosomal DNA profiling are rapidly increasing to include focal copy-number aberrations in addition to the large ones used in the breast-cancer studies described above (14, 19,27). Also the prepossessing algorithms are being customized to the discrete nature of chromosomal copy-number data (13). In fact, reanalysis of one breast-cancer data set using a DNA copy number dedicated preprocessing and unsupervised clustering algorithm yielded subtypes with higher prognostic value (6, 31). These developments highlight the potential of DNA profiling for the clinic in the near future. Either tumors will be classified based on chromosomal copy numbers alone or in combination with balanced translocation and mutation data (7). The presented exercise illustrates the added value of chromosomal DNA profiling in addition to established and powerful clinico pathologic classifiers such as ER, PR, and ERBB2 for breast cancer. Ultimately, profiles need to be implemented in accordance with the REMARK guidelines (1).
For many ongoing studies such as the ATLAS project, as well as clinical trials such as MINDACT, TransBIG, and I-SPY, biomaterials are collected and profiling is planned for both RNA expression and DNA aberrations (mutation, translocation, copy number; ref. 32). In addition, DNA methylation and noncoding RNAs will be studied. Particularly, the microRNA class of noncoding RNAs are promising in that perspective, because their small size makes them a well-preserved entity in FFPE samples and could allow for yet an alternative stratification marker molecule to be applied in the clinic. The jury is out whether “to DNA or not to DNA?” in daily clinic practice, but we predict a boost for DNA profiling in the near future.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Acknowledgments
We wish to acknowledge J. R. Pollack and F. Andre for their collaboration and O. Krijgsman for critically reading this manuscript prior to submission.
Grant Support
This study was carried out within the framework of CTMM, the Center for Translational Molecular Medicine projects DeCoDe (grant 03O-101) and financially supported by the Dutch Cancer Foundation (KWF).
References
Supplementary data
PDF file - 1.94 MB
PDF file - 2.45 MB
PDF file - 7.64 MB
PDF file - 3.65 MB