The Cancer Genome Atlas (TCGA) project has generated abundant genomic data for human cancers of various histopathology types and enabled exploring cancer molecular pathology per big data approach. We developed a new algorithm based on most differentially expressed genes (DEG) per pairwise comparisons to calculate correlation coefficients to be used to quantify similarity within and between cancer types. We systematically compared TCGA cancers, demonstrating high correlation within types and low correlation between types, thus establishing molecular specificity of cancer types and an alternative diagnostic method largely equivalent to histopathology. Different coefficients for different cancers in study may reveal that the degree of the within-type homogeneity varies by cancer types. We also performed the same calculation using the TCGA-derived DEGs on patient-derived xenografts (PDX) of different histopathology types corresponding to the TCGA types, as well as on cancer cell lines. We, for the first time, demonstrated highly similar patterns for within- and between-type correlation between PDXs and patient samples in a systematic study, confirming the high relevance of PDXs as surrogate experimental models for human diseases. In contrast, cancer cell lines have drastically reduced expression similarity to both PDXs and patient samples. The studies also revealed high similarity between some types, for example, LUSC and HNSCC, but low similarity between certain subtypes, for example, LUAD and LUSC. Our newly developed algorithm seems to be a practical diagnostic method to classify and reclassify a disease, either human or xenograft, with better accuracy than traditional histopathology. Cancer Res; 76(16); 4619–26. ©2016 AACR.

Cancers are heterogeneous with diverse pathogenesis. Their accurate diagnosis helps to understand disease development and prognosis, thus guiding precision treatments. Clinical diagnosis is primarily based on anatomic locations (organs) and histopathology (morphology of cancerous tissues and cells) and may not be accurate. For example, a metastasis could be misdiagnosed if the morphology is insufficient to identify its origin. An improved diagnostic method is therefore needed. Transcriptome sequencing [RNA sequencing (RNA-seq) or microarray] profiles gene expression, which may be used to describe molecular pathology of cancers and diagnose disease. To do so, it is necessary to first systematically demonstrate that good correspondence exists between histopathology and molecular pathology, which has been made possible by the availability of pathology and genomic data from The Cancer Genome Atlas (TCGA) project that profiled thousands of cancer samples of various histopathology types (1–4).

Patient-derived xenografts (PDX) never manipulated in vitro are considered patient avatars and are used as an experimental model to study pathogenesis, to assess pharmaceutical effects, and to guide precision medicine (5–9). Various anecdotal reports have shown that the xenograft diseases mirror corresponding patient diseases in histopathology and molecular pathology. If such similarity is systematically verified and further quantified, the translational utility of PDXs can be immensely explored and expanded. We have built a large library of PDXs over the years (7, 8, 10–13) and also performed transcriptome sequencing and/or microarray analysis. The pathologic relevance of PDXs to patient tumors, both histologically and molecularly, can now be examined by comparing TCGA and PDX data.

We set out to establish a new diagnostic method based on pairwise comparison of cancers using transcriptome expression data, an approach different from the methods using multiple types of genomic data and complex algorithms more commonly used by other investigators (1, 2). We reasoned that our method has the advantage of being simple and unbiased in assessing and describing cancer type specificity. We systematically compared similarities of TCGA cancers both within and between histopathology types and explored the relationships of diverse types by the development of new algorithms. Our results demonstrated a molecular alternative to traditional histopathology for diagnosing human and xenograft diseases with better accuracy and precision. Our data further showed that PDXs are indeed similar to their original diseases in various cancer types, which does not hold true for cancer cell lines.

Engraftment and molecular characterization of xenograft tissues

Methods and parameters regarding xenografting of patient tissues (Crown Bioscience SPF facility) have been described previously (7, 8, 10, 11). For transcriptome sequencing of PDX tumor tissues, snap-frozen samples were used to extract RNAs per the method described previously (7, 8). The purity and integrity of the RNA samples were ensured by Agilent Bioanalyzer prior to RNA-seq. Only RNA samples with RNA integrity number >7 and 28S/18S >1 were proceeded for library construction and RNA-seq. RNA samples (mouse component <50%) were used for transcriptome sequencing by certified Illumina HiSeq platform service providers (BGI). Transcriptome sequencing was generally performed at 6GB, PE125 on Illumina HiSeq2500 platform or equivalent. For Affymetrix U219 GeneChip profiling, RNA samples from tumors were processed and assayed as described previously (7, 8). Standard IHC was used to analyze selected FFPE PDX tumor tissues as described previously (7, 8). The antibodies used for IHC were anti-human mAb TTF1 (ZM-0250, mouse), CDX2 (ZA-0520, rabbit), CK7 (ZM-0071, mouse), CK20 (ZM-0075, mouse), all from Zhongshan Jinqiao.

TCGA and Cancer Cell Line Encyclopedia datasets

Level 3 TCGA RNA-seq data for 7 cancer types [colon adenocarcinoma (COAD), rectum adenocarcinoma (READ), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), head and neck squamous cell carcinoma (HNSCC), liver hepatocellular carcinoma (LIHC), and pancreatic adenocarcinoma (PAAD)] were downloaded from the TCGA data portal (February 2015 release, https://tcga-data.nci.nih.gov/tcga/). We only used the RNA-seq data generated by the Illumina HiSeq platform and processed by the RNAseqV2 pipeline, which used MapSplice for read alignment and RSEM for quantification (https://cghub.ucsc.edu/). The TCGA dataset contains 285 COADs, 94 READs, 515 LUADs, 501 LUSCs, 519 HNSCCs, 371 LIHCs, and 178 PAADs.

The cancer cell line gene expression data were downloaded from the Cancer Cell Line Encyclopedia (CCLE) data portal (October 2012 release, http://www.broadinstitute.org/ccle/home). The expression was profiled on Affymetrix U133Plus2 GeneChip. The raw Affymetrix CEL files were converted into gene expression values by the robust multiarray average (RMA) algorithm with a custom CDF file (ENTREZF v15, http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/15.0.0/entrezg.asp). A total of 210 cell lines were used, including 47 colorectal adenocarcinomas (CRAD), 52 LUADs, 28 LUSCs, 30 HNSCCs, 25 LIHCs, and 28 PAADs (Supplementary Table S1).

Bioinformatics analysis of PDX transcriptome sequencing data

Gene expression in PDXs was profiled by both Affymetrix U219 GeneChip and RNA-seq per the methods described previously (7, 12). The Affymetrix CEL files were processed using the same method for CCLE data. The RNA-seq raw data were first cleaned up by removing mouse reads mapped to a mouse reference genome (UCSC MM9). The average mouse content is about 10%. Gene expression was estimated using the TCGA RNAseqV2 pipeline. A total of 175 PDXs with Affymetrix U219 data were used, including 58 CRADs, 11 LUADs, 40 LUSCs, 10 HNSCCs, 24 LIHCs, and 32 PAADs. A total of 241 PDXs with RNA-seq data were used, including 82 CRADs, 12 LUADs, 54 LUSCs, 14 HNSCCs, 30 LIHCs, and 49 PAADs.

Comparison of transcriptome expression datasets

The edgeR package (version 3.10.2; ref. 14) from Bioconductor (version 3.1) was used to analyze the TCGA RNA-seq data. Genes with at least one count per million in at least 94 samples, the smallest of all 7 cancers, were kept. Differentially expressed genes (DEG) were identified and ranked by the exactTest function. For the 7 TCGA cancer types, we performed 21 pairwise comparisons and retained certain numbers of top DEGs. Expression values of DEGs were normalized to have zero mean and unit variance and used to calculate Pearson correlation coefficients among samples. In Fig. 1, 94 samples for each of the 7 cancers in TCGA were used by random sampling. For the other 3 datasets, the expression values were normalized as well in calculating within-type and between-type Pearson correlation coefficients. All expression values were in logarithmic scale in the correlation calculation and heatmaps.

Figure 1.

Comparison of gene expression of within and between cancer types when the number of pairwise DEGs is 50, where there are 564 unique genes. In the heatmaps, Pearson correlation coefficient between two samples is color coded; the length of a color bar on the top or left is proportional to sample size within a dataset. A, comparison of gene expression for TCGA patient samples profiled by RNA-seq, PDXs profiled by RNA-seq (PDX), PDXs profiled by microarray (PDXU219), and cancer cell lines profiled by microarray (CCLE). B, comparison of gene expression between TCGA and the other three datasets.

Figure 1.

Comparison of gene expression of within and between cancer types when the number of pairwise DEGs is 50, where there are 564 unique genes. In the heatmaps, Pearson correlation coefficient between two samples is color coded; the length of a color bar on the top or left is proportional to sample size within a dataset. A, comparison of gene expression for TCGA patient samples profiled by RNA-seq, PDXs profiled by RNA-seq (PDX), PDXs profiled by microarray (PDXU219), and cancer cell lines profiled by microarray (CCLE). B, comparison of gene expression between TCGA and the other three datasets.

Close modal

Expression similarity within and between histopathologic cancer types

We set out to inquire whether cancers of the same histopathologic diagnosis have similar expression profiles, as compared with different histopathology types. We examined 4 transcriptome expression datasets: (i) the TCGA transcriptome sequencing (RNA-seq) dataset (2–4) for patient tumor samples obtained through surgery or biopsy; (ii) the RNA-seq dataset (referred to as PDX; ref. 12); (iii) the microarray dataset (referred to as PDXU219) for PDX of various diseases (7, 8, 10); and (iv) the microarray dataset for cancer cell lines from the CCLE project (15). First of all, we aimed at establishing an algorithm to define human disease types by transcriptome expression, postulating that distinct gene expression signature is the molecular hallmark of both normal and tumor tissues (or types as defined). To this end, we performed 21 pairwise comparisons of transcriptome expression for 7 TCGA cancers: COAD, READ, LUAD, LUSC, HNSCC, LIHC, and PAAD. For each pairwise comparison, we retained the same number of the most DEGs, ranked by P values from the exactTest function in the edgeR package in R (see Materials and Methods). The total DEGs, by summing up from all pairwise comparisons with redundancy removal, were used to calculate the within-type (histopathology type) and between-type correlation coefficients for the TCGA dataset. The correlation coefficients were used to quantify cancer similarity (Fig. 1A). A total of 564 genes, which is the nonredundant set when the number of pairwise DEGs is 50, are shown in the illustration in Fig. 1. The similarity patterns hold true for other numbers of DEGs, up to whole transcriptome (Supplementary Fig. S1). This pairwise comparison approach is intended to minimize bias toward certain cancer types, as opposed to the methods that select genes by simultaneously comparing all cancer types, for example, one-way ANOVA.

We observed that the within-type correlation coefficients initially decrease rapidly, then stabilize for all cancer types in TCGA as the number of DEGs increases (Fig. 2A), because relatively few new genes are added at high numbers of DEGs (Supplementary Fig. S2). When the number of pairwise DEGs reaches 7,000, there are 16,798 unique genes, and about 97.1% of the 17,288 genes are eligible for pairwise comparison in the TCGA dataset. The relatively high within-type coefficients (as opposed to between-type coefficients, see below) demonstrate cancer type specificity, which is largely in accordance with histopathology classification. Meanwhile, the within-type correlation coefficients at any given DEGs vary among cancer types, reflecting their different degree of homogeneity. For example, LIHC seems to be much more homogeneous than other types. When the number of pairwise DEGs is 50, there are 564 unique genes. We used the Database for Annotation, Visualization, and Integrated Discovery (16) to analyze tissue-specific expression and found that these genes are significantly enriched for liver (P = 2.2E−41 in category CGAP_EST_QUARTILE, P = 3.2E−31 in category UP_TISSUE, and P = 1.4E−30 in category UNIGENE_EST_QUARTILE).

Figure 2.

Gene expression similarity within each cancer type at different numbers of pairwise DEGs in four datasets: TCGA (A), PDX (B), PDXU219 (C), and CCLE (D). For each cancer type in a dataset, Pearson correlation coefficients for all pairs of samples were calculated on the basis of the normalized gene expression values. Values, mean and SEM.

Figure 2.

Gene expression similarity within each cancer type at different numbers of pairwise DEGs in four datasets: TCGA (A), PDX (B), PDXU219 (C), and CCLE (D). For each cancer type in a dataset, Pearson correlation coefficients for all pairs of samples were calculated on the basis of the normalized gene expression values. Values, mean and SEM.

Close modal

PDX diseases are largely reflective of original patient diseases per histopathology, cell types, differentiation phenotypes (5–8, 17), and also per molecular pathology as reported in a number of isolated studies (5, 6). To systematically investigate such relevance, we subsequently performed the correlation coefficient calculation for PDX (RNA-seq) and PDXU219 datasets (7, 8) using the same DEGs derived from above-mentioned TCGA pairwise comparisons. We made several observations (Fig. 2B and C): (i) In both datasets, we also observed an initial rapid decline in correlation coefficients, parallel to TCGA, with the increase in DEGs for all cancer types. This parallelism suggests that the same DEGs can also describe the cancer type specificity in PDXs as seen in TCGA, and thus shows the similarity between TCGA and PDX; (ii) the overall values of correlation coefficient in PDXs are lower than those of TCGA and may be attributed to the three factors: PDXs lost some tumor specificity (further discussed below), TCGA-centric approach likely leads to lower values in PDXs, especially at low numbers of DEGs, and PDX lacks human tumor stroma; (iii) the within-type correlation coefficients at any given DEGs vary significantly among PDX cancer types as well, reflecting different degree of homogeneity, as seen in TCGA. In particular, they may vary in values not in concordance with those in TCGA. For example, HNSCC, but not LIHC, has the highest within-type correlation in PDXs. This suggests that the same cancer type can have different homogeneity in PDXs than in human, and such difference may be reflective of how far away PDXs have drifted from human tumors, but it may also be attributed to small sample sizes of HNSCC PDXs (10 in the PDXU219 dataset and 14 in the PDX dataset); (iv) it is worth noting that PDXU219 and PDX (RNA-seq) are almost parallel to each other with similar correlation coefficient values, implying a near equivalence of the two expression profiling approaches (Fig. 3). Overall, our observations agree with anecdotal reports that PDXs have similar molecular profiles as the tumors from which they were derived (5, 6).

Figure 3.

Average within-type and between-type gene expression similarity at different numbers of pairwise DEGs in four datasets. A, Pearson correlation coefficients for all pairs of samples within the same cancer type in a dataset were calculated. B, Pearson correlation coefficients for all pairs of samples belonging to different cancer types in a dataset were calculated. Normalized gene expression values were used in calculations. Values, mean and SEM.

Figure 3.

Average within-type and between-type gene expression similarity at different numbers of pairwise DEGs in four datasets. A, Pearson correlation coefficients for all pairs of samples within the same cancer type in a dataset were calculated. B, Pearson correlation coefficients for all pairs of samples belonging to different cancer types in a dataset were calculated. Normalized gene expression values were used in calculations. Values, mean and SEM.

Close modal

Traditional cancer cell lines immortally grow in plastic flasks, usually clonally and with uniform morphology of undifferentiated phenotype. Many can grow in xenografts, but with compact and homogeneous morphology of little differentiation, which are all in sharp contrast to PDX. Therefore, they have been considered less relevant to human cancers, as compared with PDXs (5). Similarly, we also performed the within-type correlation coefficient calculation for the CCLE dataset. Interestingly, we barely observed any parallel decline of coefficients with the increase of DEGs for all cancer types except HNSCC, suggesting the selected DEGs from TCGA have little relevance in CCLE (Fig. 2D). Furthermore, the within-type correlation coefficients are significantly lower in CCLE than in TCGA, PDX, and PDXU219 (Fig. 3). It is unlikely that such decrease can be attributed to the TCGA-centric approach. The poor cancer type specificity observed in CCLE is consistent with the notion that cell lines deviate away from human cancers, both histopathologically and molecular pathologically. However, the within-type correlation coefficients, although low in general, do vary by types. For instance, HNSCC cell lines show relatively higher coefficients (Fig. 2D). In summary, at any number of DEGs, the within-type correlation coefficients are highest in the TCGA dataset, lowest in the CCLE dataset, and intermediate in the PDX and PDXU219 datasets.

Next, we performed the between-type correlation coefficient calculation using the same DEGs. We found that the coefficients are all negative and close to zero, reflecting that generally, little similarity exists between different cancer types in all 4 datasets. Analogous to the within-type correlation, TCGA has the largest absolute values of correlation coefficient that exhibit an initial decline, PDX and PDXU219 have the intermediate values with parallel decline, whereas in CCLE, the values are smallest and flat (Fig. 3). In conclusion, patient tumors have the most pronounced cancer type-specific gene expression profiles and, in general, have high correlation among the same histologic cancer types. PDXs (subcutaneously engrafted tumors) still maintain reasonable specificity, although not to the extent of human tumors, and are markedly better than cancer cell lines. With all the above analyses, we established a good degree of equivalence between two diagnosis methods, one based on histologic morphology and tumor origin and the other on transcriptome expression.

Expression similarity between different cancer types and dissimilarity within same types

Besides the aforementioned high within-type correlation and low between-type correlation in general, we also made some other interesting observations from patient tumors and PDXs (Fig. 1). First, COAD and READ are virtually indistinguishable, suggesting that they could be essentially the same disease. Second, LUAD and LUSC have quite distinctive expression profiles even though both belong to non–small cell lung carcinoma, consistent with fact that they have distinct morphology and pathogenesis. Third, HNSCC is highly similar to LUSC by expression profiles, in accordance with the reported results in patient samples (1). It would be interesting to investigate the shared pathogenesis between these two squamous cell carcinomas.

Such observations again demonstrate the close relevance of PDX to human tumors. In contrast, in the CCLE dataset, LUAD and LUSC are not separable from each other. In fact, they have the lowest within-type correlation coefficients, being 0.067 and 0.080 when the number of pairwise DEGs is 50. Our pathology examination of lung cell line–derived xenografts did not show morphologic correlation within LUAD cell lines (e.g., A459, NCI-H1975, LU0682, LU6912, data not shown) and within LUSC cell lines (LU0357, data not shown). In the CCLE dataset, we did not observe high similarity between HNSCC and LUSC; their between-type correlation coefficient is only 0.052 when the number of pairwise DEGs is 50, while the within-type correlation coefficient for HNSCC is 0.36. In summary, cancer cell lines have lost much tumor-specific expression existing in human cancers and PDXs.

Molecular pathology signature derived from TCGA for cancer classification

By using the DEGs derived from the pairwise comparisons between TCGA cancer types, we can classify and diagnose malignant diseases of unknown cancer type for both human tumors and PDXs, but unlikely for cell lines. Results from this molecular pathology approach are in good agreement with traditional histopathology, thus forming the basis of a new molecular diagnosis. As an example, we used 188 signature genes from the pairwise comparisons of 4 TCGA cancers (LUAD, LUSC, COAD, and READ) by setting pairwise DEGs to 50. By design and as expected, these signature genes distinguish colorectal cancers from lung cancers in TCGA (Fig. 4A). When applied to both PDX and PDXU219 datasets, we observed that the colorectal PDXs and lung PDXs are clustered with corresponding TCGA cancer samples (Fig. 4A and B). However, in the CCLE dataset, the 3 cancers (CRAD, LUAD, and LUSC) do not show good separation, and they seem to form a wide-spread cluster by themselves between the TCGA lung and colorectal cancer samples (Fig. 4C). Because both PDXU219 and CCLE were profiled by Affymetrix microarrays, it is unlikely that the dislocation of CCLE samples is a technical artifact, but rather reflective of their transcriptome expression drift from both human and PDX tumors.

Figure 4.

Multidimensional scaling plots of colorectal cancer and lung cancer samples in four studies: TCGA and PDX (A), TCGA and PDXU219 (B), TCGA and CCLE (C), and PDX (D). dim, dimension. In the PDX dataset, four misclassified samples are labeled. Numbers in parenthesis are sample sizes. The multidimensional scaling plots use 188 genes when the number of pairwise DEGs is 50. LogFC, log fold change. The first two leading log-fold changes were used at the two axes.

Figure 4.

Multidimensional scaling plots of colorectal cancer and lung cancer samples in four studies: TCGA and PDX (A), TCGA and PDXU219 (B), TCGA and CCLE (C), and PDX (D). dim, dimension. In the PDX dataset, four misclassified samples are labeled. Numbers in parenthesis are sample sizes. The multidimensional scaling plots use 188 genes when the number of pairwise DEGs is 50. LogFC, log fold change. The first two leading log-fold changes were used at the two axes.

Close modal

To demonstrate the classification power of our method, we applied the signature DEGs to the PDX dataset and plotted the samples by datasets (Supplementary Table S2). Again, we observed a clear separation of cancer types (Fig. 4D). We also saw 4 outliers, a colorectal PDX model (CR2215) in the lung cancer group and 3 lung cancer PDX models (LU1207, LU1245, LU3099) in the colorectal cancer group. We performed immunohistochemical analysis using tissue-specific biomarkers (Supplementary Table S3 and S4; Supplementary Fig. S3) to confirm their identity. The IHC results demonstrated that the 3 misclassified lung cancer models are indeed CRADs. The only misclassified CRAD is in fact PAAD. Our current interpretation is that the original hospital diagnosis was wrong. Although LU1245, LU3099, and LU1207 were derived from tumors taken from lung and with adenocarcinoma morphology, they might actually be the metastasis from primary CRAD. Prior histopathology might be unable to identify them correctly because they are all adenocarcinoma with similar morphology.

Our DEG-based method can be used to build machine-learning classifiers to diagnose cancers. To illustrate this, we randomly partitioned 2,463 TCGA patient samples into a training dataset and a validation dataset with an 80:20 split ratio. A support vector machine (SVM) based on the 564 DEGs was trained in the training dataset with 5-fold cross-validations and then tested in the validation dataset. The partition and subsequent processes were repeated 10 times. In both cross-validations and test dataset evaluation, the SVM consistently achieved approximately 98% classification accuracy if COAD and ROAD samples were treated as the same disease.

We emphasize that this signature cannot be used to classify cell lines or cell line–derived xenografts. Overall, human cancers (TCGA) and PDX overlap quite well, but not so well with CCLE (Fig. 1). In other words, we would not recommend to use the same signature to classify them into three different categories as in PDX and human samples above.

With the available TCGA datasets from multiple genomic profiling platforms, molecular taxonomy methods have been developed and tested (1, 2). Many such methods analyze samples from multiple cancer types simultaneously and may be biased toward certain types. We developed a method based on pairwise comparisons between cancer types that can reduce such bias. For a dataset with n cancer types, we select m DEGs for each pairwise comparison between cancer types. In a global comparison involving multiple cancer types, such as the ones shown in Fig. 1, samples in any cancer type pair can be distinguished by their m DEGs, while other DEGs that are capped at (n+1)(n−2)m/2 but usually fewer due to overlapping, can be viewed as background noise, although at times, they may contribute more to certain pairs, like HNSCC and LUSC in TCGA, due to inherent expression similarity.

Furthermore, as disease type traditionally is determined by organ and tissue origin, the normal counterparts of disease tissues are largely distinguished by transcriptome expression (data not shown), while not by somatic mutations. We therefore reasoned that a method purely based on transcriptome expression should be sufficiently accurate to molecularly define cancer taxonomy, as well as normal tissue taxonomy. Indeed, our method is able to define cancer type specificity and establish near equivalency between the resulting molecular classification and the traditional disease classification based on tumor origin and histopathology. It also could be used for molecular diagnosis, a complement to the existing one based on histopathology, with certain superiority. It is worth mentioning that this method, or even the algorithm developed on the basis of TCGA here, could be used to explore the normal tissue taxonomy, which has never been reported before, at least per large data perspective.

As there is little limitation to the level of classifications done by our molecular pathology method, it can reach a degree significantly beyond existing histopathology-based methods and can be more accurate, reliable, and with better objectivity. Ultimately, we hope to establish refined signature/subclassifications that can be associated with defined pathogenic mechanisms, prognosis patterns, and predictive responses to certain treatments.

Although there are reports that the PDX diseases closely mimic the original human diseases with similar histo- and molecular pathology, there has yet to be a report to comprehensively and systematically compare the molecular pathology of human and xenograft diseases on a large scale. Such studies are now possible with the availability of large collections of PDXs (7–11) with transcriptome datasets and the abundant human equivalent datasets via TCGA (2–4). By using TCGA pairwise-derived DEGs, we for the first time established a molecular equivalence between human diseases (TCGA) and xenograft diseases (PDX) of the same pathology diagnosis in a broad sense, not just anecdotal observations. The advantage of this molecular diagnosis can be exemplified by its ability to correct the wrong diagnosis made by hospitals. The similarity seen by our analysis seems to further support the notion that PDXs are more likely to be predictive of drug response in humans than cell lines.

We need to mention that there are several variables that might contribute to the difference observed in our systematic analysis of TCGA and PDX datasets. First, the method established is TCGA centric, and lower coefficients for PDX are expected. Second, all PDXs were derived from subcutaneous tumors in mice, which are similar to “metastatic tumors” (6), while human tumors of TCGA are mostly not (primary tumors are considered “orthotopic;” refs. 7, 8). It is our hypothesis that metastatic tumors of the same type would likely demonstrate lower correlation coefficients than primary tumors of the same type, just like PDXs (6). Third, because of the “orthotopic” nature of TCGA tumors, TCGA transcriptome data contain those of the normal human stromal components that are hard to remove completely. Such human stromal components are largely absent from PDXs, while the presence of the mouse stromal components in PDX is easily removed from the final PDX transcriptome datasets (7, 12). Fourth, PDXs growing in mice lack certain microenvironment, for example, the absence of human hormone and growth factors, which forces the tumor to adapt and deviate further away from human tumors. Fifth, our PDX datasets described in this report are mainly derived from Asian patients (2, 7, 8), whereas TCGA datasets are largely from Western patients (1, 2).

Failure to establish the disease-specific molecular pathology for cancer cell lines using the DEG method seems to be consistent with the morphology observations. A majority of the cancer cell lines or cell line–derived xenografts of different diseases have certain homogenous morphology (uniform cell types with little differentiation) that is very different from human diseases (18) as well as PDXs (17). All the cell lines have similar environment of plastic surface, with no resemblance to any human tissues or even mouse tissues. This may also explain that the cell line–derived models are not particularly predictive of human response to pharmaceutical agents (19, 20). It is worth mentioning that our analyses and results do not rule out that some cell lines still maintain certain similar gene expression pattern to their original tissues. The in vitro immortalized cell lines usually go through irreversible changes, or crisis, from their original tumors, resulting in loss of tissue specificity, while PDXs do not. It would be very interesting to compare the molecular pathology of human tumors with those of other new experimental systems that do not go through crisis and share histopathology features as in human tumors, such as organoid culture (21, 22). If these systems do have similar molecular pathology, they could also function as predictive models, like PDXs.

Q.-X. Li is the vice president at Crown Bioscience. No potential conflicts of interest were disclosed by the other authors.

Conception and design: S. Guo, Q.-X. Li

Development of methodology: S. Guo

Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): W. Qian, J. Cai

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): S. Guo, W. Qian

Writing, review, and/or revision of the manuscript: S. Guo, J. Cai, J.-P. Wery

Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): J. Cai, J.-P. Wery

Study supervision: J.-P. Wery, Q.-X. Li

Other (raw data generation): J. Cai

Other (performed IHC study): L. Zhang

The authors thank the patients who donated their tissues for this study, the scientists from BMS who have profiled some of these PDXs and helped to understand genomic/genetic background of these models (Drs. Heshani DeSilva, Petra Ross-Macdonald, Aiqing He, Xiadi Zhou, Matthew Lorenzi, Marco Gottardis, and Rolf-Peter Ryseck), and the technicians at Division of Translational Oncology and Animal Center at Crown Bioscience for technical support of this work.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1.
Hoadley
KA
,
Yau
C
,
Wolf
DM
,
Cherniack
AD
,
Tamborero
D
,
Ng
S
, et al
Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin
.
Cell
2014
;
158
:
929
44
.
2.
The Cancer Genome Atlas Research Network
. 
Comprehensive molecular characterization of gastric adenocarcinoma
.
Nature
2014
;
513
:
202
9
.
3.
The Cancer Genome Atlas Research Network
. 
Comprehensive genomic characterization defines human glioblastoma genes and core pathways
.
Nature
2008
;
455
:
1061
8
.
4.
Ge
L
,
Shao
GR
,
Wang
HJ
,
Song
SL
,
Xin
G
,
Wu
M
, et al
Integrated analysis of gene expression profile and genetic variations associated with ovarian cancer
.
Eur Rev Med Pharmacol Sci
2015
;
19
:
2703
10
.
5.
Tentler
JJ
,
Tan
AC
,
Weekes
CD
,
Jimeno
A
,
Leong
S
,
Pitts
TM
, et al
Patient-derived tumour xenografts as models for oncology drug development
.
Nat Rev Clin Oncol
2012
;
9
:
338
50
.
6.
Ding
L
,
Ellis
MJ
,
Li
S
,
Larson
DE
,
Chen
K
,
Wallis
JW
, et al
Genome remodelling in a basal-like breast cancer metastasis and xenograft
.
Nature
2010
;
464
:
999
1005
.
7.
Yang
M
,
Shan
B
,
Li
Q
,
Song
X
,
Cai
J
,
Deng
J
, et al
Overcoming erlotinib resistance with tailored treatment regimen in patient-derived xenografts from naive Asian NSCLC patients
.
Int J Cancer
2013
;
132
:
E74
84
.
8.
Zhang
L
,
Yang
J
,
Cai
J
,
Song
X
,
Deng
J
,
Huang
X
, et al
A subset of gastric cancers with EGFR amplification and overexpression respond to cetuximab therapy
.
Sci Rep
2013
;
3
:
2992
.
9.
Walter
AO
,
Sjin
RT
,
Haringsma
HJ
,
Ohashi
K
,
Sun
J
,
Lee
K
, et al
Discovery of a mutant-selective covalent inhibitor of EGFR that overcomes T790M-mediated resistance in NSCLC
.
Cancer Discov
2013
;
3
:
1404
15
.
10.
Jiang
J
,
Wang
DD
,
Yang
M
,
Chen
D
,
Pang
L
,
Guo
S
, et al
Comprehensive characterization of chemotherapeutic efficacy on metastases in the established gastric neuroendocrine cancer patient derived xenograft model
.
Oncotarget
2015
;
6
:
15639
51
.
11.
Bladt
F
,
Friese-Hamim
M
,
Ihling
C
,
Wilm
C
,
Blaukat
A
. 
The c-Met inhibitor MSC2156119J effectively inhibits tumor growth in liver cancer models
.
Cancers
2014
;
6
:
1736
52
.
12.
Chen
D
,
Huang
X
,
Cai
J
,
Guo
S
,
Qian
W
,
Wery
JP
, et al
A set of defined oncogenic mutation alleles seems to better predict the response to cetuximab in CRC patient-derived xenograft than KRAS 12/13 mutations
.
Oncotarget
2015
;
6
:
40815
21
.
13.
Yang
M
,
Xu
X
,
Cai
J
,
Ning
J
,
Wery
JP
,
Li
QX
. 
NSCLC harboring EGFR exon-20 insertions after the regulatory C-helix of kinase domain responds poorly to known EGFR inhibitors
.
Int J Cancer
2016
;
139
:
171
6
.
14.
Robinson
MD
,
Smyth
GK
. 
Small-sample estimation of negative binomial dispersion, with applications to SAGE data
.
Biostatistics
2008
;
9
:
321
32
.
15.
Barretina
J
,
Caponigro
G
,
Stransky
N
,
Venkatesan
K
,
Margolin
AA
,
Kim
S
, et al
The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity
.
Nature
2012
;
483
:
603
7
.
16.
Huang da
W
,
Sherman
BT
,
Zheng
X
,
Yang
J
,
Imamichi
T
,
Stephens
R
, et al
Extracting biological meaning from large gene lists with DAVID
.
Curr Protoc Bioinformatics
2009
;
Chapter 13:Unit 13.11
.
17.
Akashi
Y
,
Oda
T
,
Ohara
Y
,
Miyamoto
R
,
Hashimoto
S
,
Enomoto
T
, et al
Histological advantages of the tumor graft: a murine model involving transplantation of human pancreatic cancer tissue fragments
.
Pancreas
2013
;
42
:
1275
82
.
18.
Daniel
VC
,
Marchionni
L
,
Hierman
JS
,
Rhodes
JT
,
Devereux
WL
,
Rudin
CM
, et al
A primary xenograft model of small-cell lung cancer reveals irreversible changes in gene expression imposed by culture in vitro
.
Cancer Res
2009
;
69
:
3364
73
.
19.
Johnson
JI
,
Decker
S
,
Zaharevitz
D
,
Rubinstein
LV
,
Venditti
JM
,
Schepartz
S
, et al
Relationships between drug activity in NCI preclinical in vitro and in vivo models and early clinical trials
.
Br J Cancer
2001
;
84
:
1424
31
.
20.
Voskoglou-Nomikos
T
,
Pater
JL
,
Seymour
L
. 
Clinical predictive value of the in vitro cell line, human xenograft, and mouse allograft preclinical cancer models
.
Clin Cancer Res
2003
;
9
:
4227
39
.
21.
Drost
J
,
van Jaarsveld
RH
,
Ponsioen
B
,
Zimberlin
C
,
van Boxtel
R
,
Buijs
A
, et al
Sequential cancer mutations in cultured human intestinal stem cells
.
Nature
2015
;
521
:
43
7
.
22.
Chua
CW
,
Shibata
M
,
Lei
M
,
Toivanen
R
,
Barlow
LJ
,
Bergren
SK
, et al
Single luminal epithelial progenitors can generate prostate organoids in culture
.
Nat Cell Biol
2014
;
16
:
951
61
.