Large collections of genome-wide data can facilitate the characterization of disease states and subtypes, permitting pan-cancer analysis of molecular phenotypes and evaluation of disease context for new therapeutic approaches. We analyzed 9,544 transcriptomes from more than 30 hematologic malignancies, normal blood cell types, and cell lines, and showed that disease types could be stratified in a data-driven manner. We then identified cluster-specific pathway activity, new biomarkers, and in silico drug target prioritization through interrogation of drug target databases. Using known vulnerabilities and available drug screens, we highlighted the importance of integrating molecular phenotype with drug target expression for in silico prediction of drug responsiveness. Our analysis implicated BCL2 expression level as an important indicator of venetoclax responsiveness and provided a rationale for its targeting in specific leukemia subtypes and multiple myeloma, linked several polycomb group proteins that could be targeted by small molecules (SFMBT1, CBX7, and EZH1) with chronic lymphocytic leukemia, and supported CDK6 as a disease-specific target in acute myeloid leukemia. Through integration with proteomics data, we characterized target protein expression for pre-B leukemia immunotherapy candidates, including DPEP1. These molecular data can be explored using our publicly available interactive resource, Hemap, for expediting therapeutic innovations in hematologic malignancies.

Significance:

This study describes a data resource for researching derailed cellular pathways and candidate drug targets across hematologic malignancies.

Gene expression profiles facilitate genome-wide analyses that can stratify patient subtypes and identify the activity patterns of various cellular pathways under different biological conditions (1, 2). Even though a large number of studies have characterized hematologic malignancies and normal blood cell types at genome-wide level since the introduction of microarray technology, most include only tens to hundreds of samples and focus on one disease. Thus, understanding the complete heterogeneity and similarity of diseases states and their subtypes remains an open challenge. Moreover, many hematologic malignancies are rare on the population level, necessitating collecting data across study cohorts.

Hematologic malignancies include acute and chronic leukemias of myeloid and lymphoid lineage, B-, T-, and NK-cell lymphomas, and multiple myeloma, and a number of premalignant conditions such as myelodysplastic syndrome (MDS), and myeloproliferative neoplasms (MPN). These diseases have highly variable genetic features, unique clinical courses, and varying therapeutic approaches. There is also a marked difference in prevalence, genetic background, and prognosis between adult and pediatric blood cancers. In children, acute lymphoblastic leukemia (ALL) is the most common hematologic malignancy, while in adults, non-Hodgkin lymphomas (NHL), followed by multiple myeloma, chronic lymphocytic leukemia (CLL), and acute myeloid leukemia (AML) are the most common. Treatment is moving toward increased utilization of targeted therapies in combination with traditional chemotherapies. Targeted therapies include tyrosine kinase inhibitors such as those developed against BCR-ABL fusion found in CML and some ALL cases, or antibody therapies including CD38 targeting in multiple myeloma, and engineered CAR-T cells recognizing cell surface CD19 or CD22 antigens in relapsed ALL and NHL (3–5). Yet, current therapies to treat hematologic malignancies rely heavily on drugs that target DNA metabolism in actively proliferating cells or intracellular signaling events that are involved in proliferation (6). Although these drugs have markedly improved progression-free survival, redundancy in signaling and the failure to eradicate quiescent cells (7) can facilitate the rapid development of therapy resistance. Testing a wider portfolio of new drug targets or repurposing drugs with established clinical indications represents promising strategies (6, 7). Molecular profile–guided approaches hold promise to improve the efficiency of this process (8).

We present here a resource that organizes samples from patients with cancer, healthy donors, and those at premalignant stages for comparative analysis based on both curated annotations and data-driven clustering of molecular phenotypes. This hematologic pan-cancer analysis permits the identification of clinically relevant molecular features and the exploration of new drug-targeting approaches across the disease hierarchy. The data and analysis tools are made available as an interactive online resource Hemap (http://hemap.uta.fi/) that synthesizes the curated genome-wide data across different disease subtypes.

Dataset retrieval and extraction of sample annotation data

Transcriptome datasets for Hemap were retrieved from the NCBI Gene Expression Omnibus (GEO) database (9) and represent samples hybridized to hgu133Plus2 genome-wide microarrays. The metadata were retrieved on the basis of matching disease ontology terms for hematologic malignancies against sample annotations (R/Bioconductor GEOmetadb package, “gsm” and “gse” tables), followed by manual curation, resulting in 10,470 samples. Refer to Supplementary Methods for details. Eight leukemia types, 8 B-cell lymphoma types, 7 T/NK lymphomas, multiple myeloma, and 4 proliferative disorders are represented by primary patient samples, while in total 36 disease types are included considering also their subcategories and cell line data (Supplementary Tables S1 and S2).

Data preprocessing and quality control

Samples with a median of raw probe intensity distribution deviating more than 1.5 in log2 scale from the median of all medians were deemed outliers and filtered out as well as those with an interquartile range deviating more than 0.75 from the median of interquartile ranges. Finally, duplicate samples, as well as all disease types with less than three samples (and samples assigned to these), were removed, resulting in 9,544 samples that were processed using the RMA probe summarization algorithm (10) with probe mapping to Entrez Gene IDs (from BrainArray version 18.0.0, ENTREZG). Finally, we employed the bias-correction method (11) to correct for any remaining technical differences (Supplementary Fig. S1). BeatAML (12) clinical and mutation data were obtained from source data file 41586_2018_623_MOESM3_ESM.xlsx. RNA sequencing (RNA-seq) counts were obtained from the authors. Genes with over 1 cpm expression in over 1% of the samples were kept. Data were normalized using limma voom and quantile normalization.

Dimensionality reduction

Dimensionality reduction methods are unsupervised methods that use measures of (dis)similarity and an optimization strategy to return as output sample coordinates in a lower dimension. Metrics of continuity, trustworthiness, and k-NN error were used to assess how well the visualization in 2D preserved their relative placement in the original coordinate space. We tested Gaussian Process Latent Variable Model (GPLVM; ref. 13), Locally Linear Embedding (LLE; ref. 14), Principal Component Analysis (PCA; ref. 15), Probabilistic Principal Component Analysis (PPCA; ref. 16), Sammon Mapping (SM; ref. 17), and t-Distributed Stochastic Neighbor Embedding (t-SNE; ref. 18; see Supplementary Methods for parameters). Comparison of the different methods encouraged the selection of t-SNE maps, specifically the Barnes–Hut approximated version of t-SNE implementation (BH-SNE; ref. 19). In final analysis, 15% genes with highest variance were used in construction of t-SNE maps (see ref. 20 for justification).

Assignment of cluster centers on t-SNE maps

Kernel density–based clustering algorithm (mean-shift clustering with bandwidth parameter set to 2.5, LPCM-package in R), was used to cluster the data following the dimensionality reduction. This method allows the discovery of sample sets that share similar features without having to prespecify the number of clusters. The term “cluster” is used in the text to refer to this computational clustering result, and the term “group” is used in context of visual examination. Pairwise statistical analysis between different sample features and clusters was performed as in ref. 21, based on Spearman correlation and the Bonferroni method for multiple hypothesis testing correction (see Supplementary Methods for details).

Discretizing gene expression with mixture models

Microarray hybridization generates background signal, which we would like to distinguish from real expression signal. The large sample size of Hemap was leveraged for fitting gene-specific models to cluster the gene expression in two components (expressed and not detected, denoted by 1 and −1, respectively). Gaussian finite mixture models were fitted by expectation-maximization algorithm (R package mclust version 4.3). If the uncertainty value from the model was more than 0.1, we assigned a value of 0 to denote low level. In addition, each log2 expression value lower than 4 was assigned a value −1 and values higher than 10 a value of 1. This was done to assure minimal amount of misclassifications of data samples to wrong components. The model was chosen by fitting both equal and variable variance models and ultimately choosing the model that achieved a higher Bayesian information criterion (BIC) to avoid overfitting. For drug target analysis, we utilized an adjustment for genes where background distribution was not found (gene is always expressed), or if over 90% of the samples had uncertain expression based on the model classification. Expressed state was assigned when >60% of the uncertain samples had expression above 6. Not detected status was reevaluated similarly (60% at level below 6).

Gene set analysis

The pathway and gene set enrichment analysis available in the Hemap resource was generated on the basis of gene sets retrieved from MsigDB v5.0 (molecular signatures; ref. 22), Wikipathways (06.2015; ref. 23), Recon 1 (metabolic pathways; ref. 24), Pathway Commons 7 (25), and DSigDB v1.0 (drug targets; ref. 26). Gene sets were limited to contain between 5 and 500 expressed genes (as defined above) per gene set, resulting in 19,680 gene sets that were evaluated across the dataset. The gene set variation analysis (GSVA; ref. 27), GSVA package 1.13.0 in R, was used to calculate gene set enrichment scores (positive for increased and negative for decreased expression) for each sample (parameters mx.diff = F, tau = 0.25, rnaseq = F). Significance was evaluated on the basis of empirical P values calculated using 1,000 random permutations of genes within the gene set, separately for gene set sizes 5–20, 25, 30, 40, 50, 75, 100, 200, 300, 400, and 500 to correct for differences in gene set sizes. Hypergeometric test was used to calculate enrichment of significant scores in a specific cluster.

Data sources used for evaluating drug targeting approaches

Drugs in clinical trials for leukemias, lymphomas, or multiple myeloma were obtained from ClinicalTrials.gov (accessed March 7th, 2018) maintained by the NIH (Bethesda, MD), including ongoing and terminated trials. Leukemia clinical trials were further sorted to those with clinical indication associated with AML, pre-B-ALL, CML, CLL, or multiple leukemia types. Drugs in use based on approved status in Finland were provided by the Finnish Pharmaceutical Information Centre Ltd and drugs approved by the FDA for leukemia, lymphoma, and myeloma were queried from FDA website (fda.gov–Drugs–Information on Drugs; Supplementary Table S3). A table of gene-level details for each drug was obtained from DSigDB (DSigDBv1.0 Detailed.txt; ref. 26) and integrated to Hemap in silico drug screening analysis. The list of drugs targeting epigenetic modifiers is based on the gene list with 124 genes available from ChEMBL_20 Target Classification Hierarchy (Supplementary Table S4; ref. 28). Analysis using TTD (Therapeutics Targets Database; ref. 29), DGIdb3.0 (30) for FDA-approved drugs across a wider disease context (31) as a source database was based on a total of 11,373 unique drugs and 1,270 unique genes. Drugs in use and in clinical trial included high confidence targets that were reported in several databases or had an associated PubMed identifier. A surface marker gene list with total of 996 genes was obtained from Cell Surface Protein Atlas (32) to evaluate putative novel immunotherapy targets.

Drug target in silico analysis in hierarchical manner

A disease hierarchy: (i) All disease samples; (ii) disease combinations; (iii) leukemia, lymphoma, myeloma; (iv) AML, pre-B-ALL, T-ALL, CLL; (v) disease clusters; was used to evaluate disease or subtype-specific drug target expression. Statistical significance of binary feature enrichment (e.g., high expression state) in a particular sample group was first evaluated using the hypergeometric enrichment test, followed by Bonferroni adjustment of P values. If >90% of the samples had high expression for a gene in the disease context, Inf score was assigned instead of −log10P value (hypergeometric test would not be meaningful if the sample size was close to the whole population). Each significant gene was uniquely assigned to the disease group with the lowest P value. In the case of equal P values, a broader disease category was prioritized using the disease hierarchy. As a second filtering level, the Wilcoxon test was used to test whether the drug target gene is expressed at higher level in cancer compared with normal erythroid, myeloid, B-lymphoid, or T-lymphoid samples. One normal sample group comparison was accepted for downstream analysis (with the respective comparison annotated as failed). In silico drug analysis was benchmarked using two case studies: drugs from ref. 33 and secondly known vulnerabilities (in clinical use/trial). Success rate was reported for drug target gene expression in disease, specificity for disease/subtypes, and higher expression relative to normal cells.

BeatAML drug analysis

Spearman correlation was computed for each drug AUC values and cancer map clusters, drug target genes, or target gene mutations. Furthermore, mutations with at least 5 observations and significant correlation Padj < 0.05 to drug AUC values or significant Fisher exact test Padj < 0.05 in cancer map clusters were added as features that could improve drug sensitivity prediction.

From total of 122 drugs, 47 were excluded on the basis of three criteria. First, 25 drugs with IC50 lower quartile below 10 nmol/L were excluded as these drugs have limited efficacy. Second, 9 drugs with less than 80 samples with measured drug responses were excluded. Third, only drugs with drug target information were kept, resulting in total of 75 drugs. The elastic net implemented in glmnet (34) was trained using 10-fold cross-validation using caret (35) trainControl and repeatedcv method. Caret function train and its functionality tuneGrid were used to optimize alpha parameter denoting the L1 and L2 regularization term proportions for elastic net. Each drug had three categories of features to fit the model: clusters, drug target gene expression, or mutations. To test the importance of each category in model fitting, sample order was randomly shuffled for one category while the original order was preserved for the other categories. Therefore, if the shuffled category features were important for the model fit, model overall fit should decrease as the other features are unchanged. This process was repeated 100 times and median of RMSE and R2 values were computed. Only drugs with good fit when using all the features were kept, having R2 over 0.25 and RMSE less than 0.9.

Drug sensitivity testing using patient and healthy donor samples

Bone marrow aspirates or peripheral blood samples were obtained from patients with AML (n = 52) and healthy donors (n = 15) after informed written consent using protocols approved by a local Institutional Review Board and in accordance with the Declaration of Helsinki. Mononuclear cells (MNC) were isolated by density gradient separation (Ficoll-Paque PREMIUM; GE Healthcare) and immediately used for drug testing. Cells were maintained in Mononuclear Cell Medium (MCM; Promocell) or in a 25% HS-5 conditioned medium plus 75% RPMI1640 medium mix (CM). Palbociclib and idarubicin (from Selleck) were solvated in dimethyl sulfoxide and plated in 5 different concentrations in 10-fold dilutions on 384-well plates using an Echo acoustic dispenser (Labcyte), 1–10,000 nmol/L for palbociclib; 0.1–1,000 nmol/L for idarubicin. A total of 1 × 104 cells were added per well and incubated with the drugs for 3 days at 37°C, 5% CO2. Viability was measured using the CellTiter-Glo reagent (Promega) according to the manufacturer's instructions and using the PHERAstar (BMG LABTECH) or SpectraMax Paradigm (Molecular Devices) plate readers. Sensitivity to the drugs was quantified using a drug sensitivity score (DSS), which is a modified AUC-based metric described previously (36). A selective DSS value was calculated by subtracting the mean DSS of the healthy bone marrow controls from the DSS of individual AML samples.

IHC

Anti-DPEP1 antibody (Atlas antibodies, rabbit polyclonal IgG against human renal dipeptidase 1, product number: HPA009426, lot number: A57960) was used with the dilution 1:2,500 to stain formalin-fixed and paraffin-embedded bone marrow trephine biopsy samples from pediatric patients with pre-B-ALL from the Pirkanmaa ERVA area between the years 2000 and 2017. Diagnostic samples (n = 126; including also one Burkitt lymphoma and a T-lymphoblastic leukemia/lymphoma case) were stained with a Ventana Benchmark GX using UltraView Universal DAB Detection Kit. Cytoplasmic and membranous staining was graded negative if less than 20% of the leukemic blasts were stained, positive if 20% or less than 50% of the blasts were positively stained and strong positive if 50% or a greater proportion of the blasts were positive. The analysis was performed by two pathologists without the knowledge of the patient data or the interpretation of the other analyst. The samples and clinical data were studied with the approval of the Tampere University Hospital Ethical committee (#R16054 and #R13109) and in accordance with the Declaration of Helsinki.

Interactive web resource for data analysis

The interactive online resource and the accompanying user guide for the Hemap resource are described in more detail in the Supplementary Methods and available at http://hemap.uta.fi/.

Integrating transcriptomes to characterize molecular states across hematologic malignancies

For the comparative analysis of hematologic malignancies on molecular level, we assembled gene expression profiles from the NCBI GEO database (9), comprising patient samples representing different cancers and proliferative disorders, and including cell lines and normal blood cell types as controls. Sample annotations were curated, and each sample was assigned a disease category. After data quality control and bias correction (see Materials and Methods; Supplementary Fig. S1), 9,544 samples comprise the final dataset (denoted “Hemap” samples) for subsequent analysis, including 7,279 patient samples (mainly diagnostic) from hematologic malignancies (Fig. 1A; Supplementary Tables S1 and S2).

Figure 1.

A molecular stratification of hematologic malignancies and normal blood cell types is captured in a t-SNE visualization. A, Composition of the hematologic transcriptome dataset. Of the 9,544 samples, 6,820 represent hematologic malignancies (leukemia, lymphoma, or myeloma), and the rest consist of cancer cell lines, proliferative diseases (myeloid denoted pM and lymphoid denoted pL), normal blood cells (healthy donor or patient). See also Supplementary Table S2. B, The transcriptome data projected in 2D using t-SNE is shown. Each dot represents one of the 9,544 samples. Cluster assignment based on density estimation is shown in color for seven distinct clusters visible on the cancer map. C, The separation between annotated disease types (indicated by color) is shown. The lymphoid malignancies separate into acute lymphoid leukemias (pre-B-ALL, pink; T-ALL, blue), lymphomas (top right), multiple myeloma (adjacent to B-cell lymphomas), and CLL (bottom). The myeloid diseases (AML, CML, and myeloproliferative disease) are grouped closely. Samples representing normal cell types or cell lines are in gray color. Numbers refer to data-driven cluster assignment (see Supplementary Table S2).

Figure 1.

A molecular stratification of hematologic malignancies and normal blood cell types is captured in a t-SNE visualization. A, Composition of the hematologic transcriptome dataset. Of the 9,544 samples, 6,820 represent hematologic malignancies (leukemia, lymphoma, or myeloma), and the rest consist of cancer cell lines, proliferative diseases (myeloid denoted pM and lymphoid denoted pL), normal blood cells (healthy donor or patient). See also Supplementary Table S2. B, The transcriptome data projected in 2D using t-SNE is shown. Each dot represents one of the 9,544 samples. Cluster assignment based on density estimation is shown in color for seven distinct clusters visible on the cancer map. C, The separation between annotated disease types (indicated by color) is shown. The lymphoid malignancies separate into acute lymphoid leukemias (pre-B-ALL, pink; T-ALL, blue), lymphomas (top right), multiple myeloma (adjacent to B-cell lymphomas), and CLL (bottom). The myeloid diseases (AML, CML, and myeloproliferative disease) are grouped closely. Samples representing normal cell types or cell lines are in gray color. Numbers refer to data-driven cluster assignment (see Supplementary Table S2).

Close modal

To enable discovery and statistical comparison of previously known and novel molecular phenotypes alongside the annotated disease classes, we utilized a data-driven approach that allows discovery of sample groups and visualizes these for interpretation. First, we compared dimensionality reduction methods that allow visualization of complex data in 2D space. The data representation accuracy were quantitatively assessed using the metrics of continuity, trustworthiness, and k-nearest neighbor (k-NN) classifier error (see Materials and Methods; Supplementary Fig. S2). As a result, t-SNE (18) and its approximation, Barnes–Hut–SNE (BH-SNE; ref. 19), was selected, as it performed robustly (continuity and trustworthiness, 0.9860 and 0.9943, respectively) in two dimensions and still preserved the neighborhood structure (k-NN error 0.0668; Supplementary Fig. S2). The t-SNE map was then utilized for density-based clustering to assign each sample to a cluster (Fig. 1B, see t-SNE Materials and Methods for details) and the results were compared with annotated disease classes (Fig. 1C). We conclude that both quantitative and biological assessments confirm that our approach faithfully organizes the samples in an unsupervised manner based on their molecular phenotype and disease type. We denote the resulting data-driven sample grouping as the Hemap “cancer map” in the following text.

Comparative analysis associates clinical annotations and pathway activity to the molecular disease stratification

The 2D cancer map revealed a clinically relevant substructure (Supplementary Table S2), as exemplified by the different B-cell lymphomas and pre-B-ALL cytogenetic subtypes (colored in Fig. 2A and B, respectively), providing biological validation for separation of relevant phenotypes on the cancer map. A detailed comparison to annotations is presented in Supplementary Table S2. Next, statistically significant associations of clusters with gene expression levels, clinical annotations and pathway enrichment scores across different databases were calculated (see Materials and Methods). These results can be interactively tabulated and visualized using the online Hemap resource. We selected five most significant pathways at disease cluster level, or those matching pre-B-ALL subtype clusters (Fig. 2C) for visualization in a heatmap (see also Supplementary Table S5). In AML, the pathways for hematopoietic stem cell differentiation, pentose phosphate pathway, renin–angiotensin system, IL8/CXCR1-mediated signaling events and C-MYB transcription factor networks were most significantly enriched. These reflect well the known progenitor-like phenotype of AML cells. Pentose phosphate pathway, on the other hand, represents a recently uncovered vulnerability (37, 38) that is important for AML growth. Similar disease-relevant pathways were also uncovered from T-ALL (TCR pathway), CLL (BCR signaling pathway), lymphomas (cell adhesion molecules; ref. 39), and multiple myeloma (N-glycan biosynthesis; refs. 40, 41). In pre-B-ALL clusters, processes related to transcriptional regulation were highly significant (including histone modification, CTCF pathway, and RNA processing). WNT signaling (42, 43) was found as a cluster 29-specific (t1;19) enriched pathway, which matches its known relevance in these TCF3-PBX1 fusion carrying cases. Samples expressing a gene or pathway of interest can be visualized as shown in Fig. 2D, distinguishing the progenitor-like MLL subtype of pre-B-ALL based on the lack of expression of the differentiation marker MME (also known as CD10) that is used in clinical diagnostics (Fig. 2E). Similarly, most significant associations between disease clusters and drug signatures can be examined by e-staining their significance (in red), as illustrated by association of PI3K inhibitor BEZ235 gene set signature from DsigDB to pre-B-ALL (Fig. 2F), which validates a known association between a drug and a disease subtype. Further analysis on the BEZ235 gene set and several case studies on how to generate novel hypothesis are presented in the accompanying User guide to demonstrate different analysis (refer to “Explore” and “e-staining” examples).

Figure 2.

Comparison of molecular phenotypes based on the cancer map. Sample attribute visualizations are exemplified that allow characterizing the molecular phenotypes. Different BCL types (A) and pre-B-ALL subtypes (B) are colored on the basis of sample annotations (refer to Supplementary Table S2 for abbreviations). C, The five most significant pathways per disease cluster (top) or pre-B-ALL cluster (bottom) are shown as a heatmap [tones of red indicate significant enrichment to cluster (hypergeometric test, scaled P value)]. The pre-B-ALL cluster number and color (as in B) are indicated below the heatmap. D, The bimodal log2 gene expression signal distribution can be used to separate samples with low or nondetectable expression (N.D.; blue) from samples expressing the gene (red). Alternatively, significance of enrichment for gene sets and pathways from different databases can be selected for visualization (e-staining) on the cancer map. E, The corresponding gene expression state is shown on the cancer map for the B-lymphoid differentiation marker MME, where the color tones correspond to scaled log2 expression values (red, high; white, low; blue, not detected). F, Gene set enrichment for BEZ235 targets is e-stained, with empirical P < 0.05 shown in red. N.S., nonsignificant.

Figure 2.

Comparison of molecular phenotypes based on the cancer map. Sample attribute visualizations are exemplified that allow characterizing the molecular phenotypes. Different BCL types (A) and pre-B-ALL subtypes (B) are colored on the basis of sample annotations (refer to Supplementary Table S2 for abbreviations). C, The five most significant pathways per disease cluster (top) or pre-B-ALL cluster (bottom) are shown as a heatmap [tones of red indicate significant enrichment to cluster (hypergeometric test, scaled P value)]. The pre-B-ALL cluster number and color (as in B) are indicated below the heatmap. D, The bimodal log2 gene expression signal distribution can be used to separate samples with low or nondetectable expression (N.D.; blue) from samples expressing the gene (red). Alternatively, significance of enrichment for gene sets and pathways from different databases can be selected for visualization (e-staining) on the cancer map. E, The corresponding gene expression state is shown on the cancer map for the B-lymphoid differentiation marker MME, where the color tones correspond to scaled log2 expression values (red, high; white, low; blue, not detected). F, Gene set enrichment for BEZ235 targets is e-stained, with empirical P < 0.05 shown in red. N.S., nonsignificant.

Close modal

Pan-cancer analysis to recognize vulnerabilities across disease contexts

Parallel to molecular stratification, the diversity of patient profiles in Hemap has the potential to support the development of new therapeutic strategies by leveraging the information about the expression profiles across hematologic malignancies. We analyzed the specificity of drug target expression states across patient groups in a hierarchical manner (Materials and Methods), as illustrated in Fig. 3A (see also Supplementary Table S3 for a list of drugs and their targets and Supplementary Table S4 for significant associations listed by disease hierarchy). The corresponding significance ranking for targets of approved drugs is shown as heatmaps in Fig. 3A and B, where the columns represent different disease contexts and gene targets (in rows) are sorted according to their most significant association. The clinical indication for the drug(s) that could be used to target each gene is indicated in the panel on the right, while e-staining results for example drug targets are shown in Fig. 3C (see also Supplementary Fig. S3). Proteasome-targeting drugs bortezomib and carfilzomib are in use for lymphomas and multiple myeloma. Accordingly, 10 of 20 genes encoding the proteasome subunits are associated to this disease hierarchy level, or to the pan-cancer category, with highest significance (Fig. 3A). In comparison, for precision drugs such as the antibody drugs elotuzumab (SLAMF7, P < 1e-315) or daratumumab (CD38, P = 1e-196) approved for multiple myeloma, or rituximab (MS4A1, P < 1e-315 in LY+CLL) used in lymphomas and CLL (Fig. 3A), the specific gene targets can be examined. Among all known vulnerabilities (drugs in clinical use/trial) a gene-level analysis detected 84% of targets expressed and 69% were associated with highest specificity score (−log10) to the respective disease context (see Supplementary Table S3). This is exemplified by the comparison of genes with significant association to lymphoid leukemias (Fig. 3B). BCL2 targeted by venetoclax is shown as an example of an approved target in CLL that our analysis associates with this disease context and with potential for targeting in multiple myeloma. The genes marked with asterisk, including IL2RA, indicate targets of drugs approved for other hematologic malignancies. Our analysis associated these with repurposing potential in CLL and/or ALL. FLT3 is a recently approved target with disease cluster-specific expression in B-lymphoid and myeloid leukemias.

Figure 3.

Pan-cancer analysis associates disease contexts with therapeutic strategies. A, The in silico drug target analysis across disease hierarchy groups is illustrated schematically using proteasome and surface protein–targeting drugs as examples. On the left, the heatmap columns are organized by disease, and drug targets (in rows) are sorted on the basis of their most significant disease context association [red color tones indicate significant P value in hypergeometric test, −log(P)]. The adjacent heatmap shows the disease indications for drugs known to target the gene in question. Notice that majority of drugs target multiple genes, as illustrated by bortezomib/carfilzomib, and only some correspond to precision drugs as exemplified by antibody targets (SLAMF7, CD19, MS4A1, and CD38). B, Comparison of targets of approved drugs with significant association to lymphoid leukemias are shown as in A. The disease indications in dimmer red tone reveal potential for repurposing of drugs approved or in clinical trials in other disease indications (notice that leukemia (LE) includes ALL and CLL). C, Example genes highlighted in the heatmaps are e-stained on the t-SNE map as in Fig. 2C. N.D., nondetectable.

Figure 3.

Pan-cancer analysis associates disease contexts with therapeutic strategies. A, The in silico drug target analysis across disease hierarchy groups is illustrated schematically using proteasome and surface protein–targeting drugs as examples. On the left, the heatmap columns are organized by disease, and drug targets (in rows) are sorted on the basis of their most significant disease context association [red color tones indicate significant P value in hypergeometric test, −log(P)]. The adjacent heatmap shows the disease indications for drugs known to target the gene in question. Notice that majority of drugs target multiple genes, as illustrated by bortezomib/carfilzomib, and only some correspond to precision drugs as exemplified by antibody targets (SLAMF7, CD19, MS4A1, and CD38). B, Comparison of targets of approved drugs with significant association to lymphoid leukemias are shown as in A. The disease indications in dimmer red tone reveal potential for repurposing of drugs approved or in clinical trials in other disease indications (notice that leukemia (LE) includes ALL and CLL). C, Example genes highlighted in the heatmaps are e-stained on the t-SNE map as in Fig. 2C. N.D., nondetectable.

Close modal

Utility of molecular disease stratification for evaluating drug screen results

Next, we examined leukemias at disease subtype level from two ex vivo drug screening datasets (12, 33). Venetoclax had lower efficacy in T-ALL versus B-ALL and lowest efficacy was in t1;19 samples in the ALL drug screen (33), which agrees with venetoclax target BCL2 gene expression in Hemap (Fig. 3C). Topotecan and dasatinib had the opposite profile, also in agreement with subtype-specific expression of their targets TOP1MT and LCK (Supplementary Fig. S3). Taken together, out of 15 drugs from this ALL screen tested with our hierarchical analysis, 14 (93%) had a candidate target expressed and 12 (80%) received highest target indication in ALL (Supplementary Table S4). Using the larger beatAML dataset (12), we set out to examine in an unbiased manner what matters more in predicting drug responsiveness: target expression, genetic lesions traditionally used to stratify patients, or the molecular phenotype as defined by clustering of transcriptome states. We implemented models using elastic nets, where a model for each drug (75 in total) was fit using these three categories of features. To test their importance for model fitting, sample order was randomly shuffled for one category while the original order was preserved for the other categories. The results for 11 drugs that achieved the best overall model fit (R2 > 0.25) are shown in Fig. 4A, including venetoclax, panobinostat (HDAC inhibitor), palbociclib (CDK4/6 inhibitor), 7 kinase inhibitors (many targeting FLT3), and an ALK inhibitor. The average R2 value from 100 tests is colored in the heatmap and summarized as a boxplot next to it. If the shuffled feature was important for the model fit, a decrease in R2 is expected (shift from darker red to dimmer or blue colors) as the other features are unchanged. For venetoclax, this analysis implicated target gene expression as the main predictor (Supplementary Fig. S4). For FLT3-targeting compounds, FLT3 mutation status was implicated as the top predictive feature (Supplementary Fig. S4). However, overall, the lack of cluster features in the model resulted in lowest predictive power. The disease clusters were the best predictors for palbociclib and panobinostat, whereas mutation status had no effect on their model fit. Panobinostat and palbociclib showed opposite drug responses in clusters 13, 2, 6 compared with cluster 1 (Fig. 4B). Hemap clusters 17, 5, and 6 corresponded to these clusters (Supplementary Fig. S4) and were similarly enriched for NPM1 and FLT3 mutations or PML-RARA fusion in both datasets. Comparison of clinical phenotypes revealed that blast morphology was different between the clusters, linking maturation level to the differential drug response (Fig. 4C; Supplementary Fig. S4).

Figure 4.

Evaluation of cluster and disease specificity of drug responses. A, A heatmap comparing how well the drug response data fits different elastic net regression models is shown (color indicates R2 values; drugs with R2 > 0.25 are shown). The values are summarized as boxplot on the right. Full model included clusters (Clust), gene expression (Gexp), and mutations (Mut), while one category is omitted in the other models. B, Palbociclib and panobinostat drug response AUC values are shown as boxplots for all AML cases and for clusters correlated to differential drug response identified for palbociclib and panobinostat. High AUC values mean drug resistance and low drug sensitivity. C, Heatmap of FAB morphology markers, cluster-specific genetic aberrations, and drug target genes (CDK6 for palbociclib and HDAC4,6,10,2 for panobinostat) is shown for same clusters as in B. D, The gene expression data for TOP2A and CDK6 are e-stained on cancer map. Comparison with normal blood cell types is shown as boxplots of the log2 gene expression signal. T, T-lymphoid; B, B-lymphoid; E, erythroid; M, myeloid. For CDK6, clusters corresponding to beatAML clusters (as in C) are shown. E, Drug sensitivity in an AML patient cohort based on DSS and sDSS scores (N = 52) is shown as boxplots for palbociclib (CDK4/6 inhibitor) and the approved AML drug (idarubicin). High difference between DSS and sDSS values indicates response in the bone marrow normal mononuclear cells, whereas low difference indicates selectivity in AML cells.

Figure 4.

Evaluation of cluster and disease specificity of drug responses. A, A heatmap comparing how well the drug response data fits different elastic net regression models is shown (color indicates R2 values; drugs with R2 > 0.25 are shown). The values are summarized as boxplot on the right. Full model included clusters (Clust), gene expression (Gexp), and mutations (Mut), while one category is omitted in the other models. B, Palbociclib and panobinostat drug response AUC values are shown as boxplots for all AML cases and for clusters correlated to differential drug response identified for palbociclib and panobinostat. High AUC values mean drug resistance and low drug sensitivity. C, Heatmap of FAB morphology markers, cluster-specific genetic aberrations, and drug target genes (CDK6 for palbociclib and HDAC4,6,10,2 for panobinostat) is shown for same clusters as in B. D, The gene expression data for TOP2A and CDK6 are e-stained on cancer map. Comparison with normal blood cell types is shown as boxplots of the log2 gene expression signal. T, T-lymphoid; B, B-lymphoid; E, erythroid; M, myeloid. For CDK6, clusters corresponding to beatAML clusters (as in C) are shown. E, Drug sensitivity in an AML patient cohort based on DSS and sDSS scores (N = 52) is shown as boxplots for palbociclib (CDK4/6 inhibitor) and the approved AML drug (idarubicin). High difference between DSS and sDSS values indicates response in the bone marrow normal mononuclear cells, whereas low difference indicates selectivity in AML cells.

Close modal

Classical targets involved in DNA metabolism (TOP2A and B) and clinically interesting targets, including CDK6, BCL2, MDM2 and VEGFR2 from clinical trials, ranked highly in our disease hierarchy analysis, as shown in Fig. 3 and Supplementary Fig. S3. However, when compared with normal cell types, only 7% of the targets had higher expression in disease than in normal cells (Supplementary Table S3). Palbociclib target CDK6 is highly expressed in all acute leukemias compared with normal blood cell types, while TOP2A has high mRNA levels also in normal blood cells (Fig. 4D). To evaluate drug sensitivity that is specific to cancer cells, an experimental ex vivo screening approach is exemplified in Fig. 4D by comparing in AML patient cell responses to the CDK4/6 inhibitor palbociclib and idarubicin targeting TOP2A (see Materials and Methods). Drug sensitivity and selective drug sensitivity scores (DSS and sDSS, respectively; see Materials and Methods; ref. 36) are compared in box plots (Fig. 4E). Overall, the AML patient bone marrow ex vivo cultures were more responsive to idarubicin (refer to Supplementary Fig. S4 for AML cell line data). However, a negative score indicating higher effect on normal bone marrow cell viability was observed for idarubicin in a larger fraction of AML cases compared with palbociclib. This observation of nonspecific response, implied by negative sDSS score, is consistent with our predictions from Hemap data. Therefore, the normal samples included in Hemap could provide valuable additional information for drug target selection. Comparison of BCL2 and BCL2L1 (also known as BCL-XL) levels are presented as another example in Supplementary Fig. S4, relevant to venetoclax versus navitoclax toxicity in targeting apoptosis. The Advanced Use Case in the Hemap User guide extends this analysis using pathway activities and drug chemical screen data.

Evaluating new therapeutic strategies in a pan-hematologic cancer context

Epigenetic regulation has emerged as an important mechanism that can corrupt the gene regulatory network (44), motivating novel therapeutic approaches. Utilizing the disease spectrum in Hemap, we performed a pan-hematologic cancer analysis of epigenetic modifiers (Supplementary Table S5), focusing on genes encoding proteins that are validated targets of small-molecule drugs (available from ChEMBL; ref. 28). We found elevated expression of this set of genes significantly enriched in CLL, T-ALL, and clusters 28 (pre-B-ALL) and 32 [AML; Fig. 5A, hypergeometric test Padj values 0.0003, 0.0074, 0.0127, 0.0174, respectively; see also Supplementary Table S5 for additional mutation frequency information (45) for the genes shown]. The expression state for six most significant genes from CLL are shown on the Hemap cancer map (Fig. 5A) and from independent validation RNA-seq data (Fig. 5B; ref. 46).

Figure 5.

Connecting the map of patient gene expression states to drug target profiles. A, Significantly enriched disease clusters are colored on the map based on pan-cancer analysis of epigenetic modifiers (purple, CLL; blue, T-ALL; green, AML cluster 32; pink, pre-B-ALL cluster 28). The expression states of the most significant drug candidates for CLL (SFMTB1, CBX7, EZH1, EHMT1, KMT2B, and BAZ2A) are shown (as in Fig. 2C) on the right. B, The expression level (log10 cpm) and SD (log2 SD) of the genes shown in A is indicated on the scatter plot representing independent RNA-seq data (GSE81274, N = 10; ref. 52). C, Significance ranking of surface target candidates for pre-B-ALL (x-axis, −log10P) are plotted against protein-level detection rate. Top candidates (P < 10−250) are indicated next to the plot. D,DPEP1 e-staining is shown as in A. E, DPEP1 IHC. The sample on the left was interpreted as negative. In the samples in the middle and on the right, over 50% of the leukemic blasts are showing membranous and cytoplasmic positivity and the staining of the sample was graded as strong positive (formalin-fixed, paraffin-embedded; ×40 magnification, Leica DM 3000 microscope, Leica MC190 HD microscope camera, Leica Application Suite software).

Figure 5.

Connecting the map of patient gene expression states to drug target profiles. A, Significantly enriched disease clusters are colored on the map based on pan-cancer analysis of epigenetic modifiers (purple, CLL; blue, T-ALL; green, AML cluster 32; pink, pre-B-ALL cluster 28). The expression states of the most significant drug candidates for CLL (SFMTB1, CBX7, EZH1, EHMT1, KMT2B, and BAZ2A) are shown (as in Fig. 2C) on the right. B, The expression level (log10 cpm) and SD (log2 SD) of the genes shown in A is indicated on the scatter plot representing independent RNA-seq data (GSE81274, N = 10; ref. 52). C, Significance ranking of surface target candidates for pre-B-ALL (x-axis, −log10P) are plotted against protein-level detection rate. Top candidates (P < 10−250) are indicated next to the plot. D,DPEP1 e-staining is shown as in A. E, DPEP1 IHC. The sample on the left was interpreted as negative. In the samples in the middle and on the right, over 50% of the leukemic blasts are showing membranous and cytoplasmic positivity and the staining of the sample was graded as strong positive (formalin-fixed, paraffin-embedded; ×40 magnification, Leica DM 3000 microscope, Leica MC190 HD microscope camera, Leica Application Suite software).

Close modal

A second promising new strategy, immunotherapy, can kill cancer cells by targeting surface proteins with antibodies (47) or chimeric antigen receptors (48). However, side effects due to targeting normal blood cells along with development of resistance occur (49). To provide a rational basis for extending the target repertoire, we used disease hierarchy analysis to rank 996 candidates available in the Cell Surface Protein Atlas (Supplementary Table S6; ref. 32), resulting in broad, disease, and subtype-specific candidates. The top ranked candidate genes in our analysis correspond to those that are uniformly high expressed within the specified disease context. The stem cell antigen CD33, actively pursued for treatment of AML (50), is among highly ranked surface targets in clinical trials shown in Supplementary Fig. S3. Next, we obtained proteomics profiles from 19 patients with B-ALL (51) to compare our ranked list for pre-B-ALL (refer to Supplementary Table S6) to protein-level expression. The trend between in silico drug target rank and protein detection rate is plotted in Fig. 5C. Validation rate for top candidates was above 75%. The highly ranked surface targets CLEC14A, DPEP1, CELSR2, MME, SDK2, INSR, GPM6B, ELFN2, FLT3, SLC22A16, FLT4, and APCDD1 correspond to those with higher expression in patients with pre-B-ALL compared with normal blood cells (see also Supplementary Fig. S5). The high gene expression state of DPEP1 (Fig. 5D) in pre-B-ALL was further validated at protein level based on IHC of diagnostic bone marrow biopsies. The grading from 117 ALL bone marrows and 16 samples representing other lymphoid malignancies or normal lymphoid tissues is presented in Table 1 and illustrated in Fig. 5E.

Table 1.

DPEP1 protein expression in bone marrow biopsies based on IHC grading

DPEP1 IHC
NegativePositiveStrong positiveTotal
Pre-B-ALL 
 BCR-ABL1 
 ETV6-RUNX1 16 10 33 
 Hyperdiploid 13 16 30 
 Hypodiploid 
 MLL Rearranged 
 TCF3-PBX1 
 Other 18 16 42 
 Total 48 50 19 117 
Other disease/tissues 
 BL 
 T-lymphoblastic leukemia/lymphoma 
 MCL 
 CLL 
 PTCL 
 CHL (NSCHL) 
 Tonsils 
 Thymus 
 Spleen 
 Total 15 16 
DPEP1 IHC
NegativePositiveStrong positiveTotal
Pre-B-ALL 
 BCR-ABL1 
 ETV6-RUNX1 16 10 33 
 Hyperdiploid 13 16 30 
 Hypodiploid 
 MLL Rearranged 
 TCF3-PBX1 
 Other 18 16 42 
 Total 48 50 19 117 
Other disease/tissues 
 BL 
 T-lymphoblastic leukemia/lymphoma 
 MCL 
 CLL 
 PTCL 
 CHL (NSCHL) 
 Tonsils 
 Thymus 
 Spleen 
 Total 15 16 

Abbreviations: BL, Burkitt lymphoma; CHL, classical Hodgkin lymphoma; MCL, mantle cell lymphoma; NSCHL, nodular sclerosis classical Hodgkin lymphoma; PTCL, peripheral T-cell lymphoma.

To further facilitate the utilization of the data, precalculated results are accessible via our interactive web resource (http://hemap.uta.fi/) including the expression state for 4,277 drug target gene sets and 1,094 drug response signatures, which can be further investigated in the context of the 12,433 pathways and molecular signatures (see Materials and Methods and User guide examples). Disease hierarchy analysis for the curated list of drug to target gene associations (11,373 drugs; 1,182 genes) from the Therapeutic Target Database (TTD; ref. 29), DGIdb (30), and targets of FDA-approved drugs across disease (31) is available in Supplementary Table S4. In this manner, in silico drug target selection based on Hemap can leverage gene and pathway expression, as evaluated across cancer types and normal blood cell types.

The integration of available genome-wide data from patients allows uncovering shared disease mechanisms and new therapeutic options. Recent work has highlighted that molecular and genetic data that helps stratify patients can dramatically increase the likelihood of success during clinical development (8, 52). However, in several cancer types, including those of hematopoietic and lymphoid tissues, the majority of data have been collected by separate studies concentrating on certain cancer types, which hinders the identification of cancer type–specific features. We present an interactive online resource, Hemap (http://hemap.uta.fi/), for analysis across multicenter gene expression datasets to investigate disease subgroups and compare molecular phenotypes across 9,544 samples from hematologic malignancies.

In practice, the samples included to Hemap are inaccessible to most clinical researchers. The Hemap resource serves to repurpose data from public repositories for clinical interpretation in an intuitive manner that does not require data analysis expertise. In future versions of Hemap, we plan to include also RNA-seq studies. Currently, the resource already contains the The Cancer Genome Atlas AML dataset and the User Guide includes examples using this data. Alongside curated disease assignment, we present a data-driven approach that organizes and integrates heterogeneous sample collections in an unbiased manner. To facilitate this, we demonstrated how unsupervised clustering and dimensionality reduction methods, here by the t-SNE method, can be used for organizing the molecular profiles for further downstream analysis. The high level of performance of t-SNE has been shown in context of various data types (18, 53, 54). In this manner, genes characterizing the patient clusters can be identified for further delineation of their functional role. In CLL, our analysis implicates high expression of several polycomb group proteins (SFMBT1, CBX7, and EZH1) in CLL that could be targeted by small molecules, in line with chromatin state data (46), and their mutation (45) frequencies, highlighting the importance to consider the spectrum of genetic and epigenetic changes in these malignancies. Earlier studies have implicated epigenetic plasticity as a key driver of CLL evolution during treatment (55). Specifically, CLL cases had little to no genetic subclonal evolution, while significant recurrent DNA methylation changes were enriched for regions near Polycomb targets (55). To further elucidate the mechanisms, inclusion of posttreatment data and integrating methylation, chromatin marker, and mutation profiles represent important future directions in developing the Hemap resource.

From a therapeutic perspective, approaches for the development of treatment strategies with a broad disease focus and molecular subtype resolution are urgently needed. We used Hemap to provide a roadmap for candidate drug therapies that allows prioritizing new candidates based on disease specificity. Our analysis recapitulated known vulnerabilities, providing additional confirmation for targets in current clinical trials: several compounds targeting Bcl2 have been developed and have shown promise in treating both CLL and non-Hodgkin's lymphoma (56, 57). However, navitoclax that also targets BCL2L1 (also known as BCL-XL) displays platelet toxicity. This potential for off-target effects was visible as high gene expression level in the erythroid lineage, supporting the choice of the more selective venetoclax. The prevalent high expression also in multiple myeloma and pre-B-ALL found in our study provides a rationale for the expansion of the testing of these compounds in lymphoid malignancies. This suggestion is additionally supported by a recent study showing that these compounds have promise in MLL-rearranged ALL (58), a pre-B-ALL subtype corresponding to cluster 29 in our dataset. However, Hemap analysis predicts insensitivity in T-ALL and t1;19 subtype, matching recent ALL drug screen data (33). Similarly, the elevated expression of the p53-regulating MDM2 in pre-B-ALL fits with recent data on successful application of antagonists in clinical trials (59), and mechanisms for its high expression ETV6-RUNX1–positive leukemias (60).

Currently, no drug screens have been carried out in primary patient cells across the spectrum of hematologic malignancies in Hemap. The utility of Hemap for drug repurposing was demonstrated in our recent study that identified dasatinib as a targeted therapy for a subgroup of patients with T-ALL (61). Here, we examined drug screen datasets to examine how differential drug responsiveness could be linked to disease subclusters and drug targets identified from the cancer maps. Using the beatAML dataset, we systematically compared the importance of mutations, clusters, and drug target gene expression in predicting drug responses. For drugs with a good overall model fit, clusters were the best predictors. However, the importance of each predictor was largely influenced by drug type. Best predictor for venetoclax response was BCL2 expression level. FLT3 mutation status and other mutations were the best predictors for kinase inhibitors. In contrast, disease clusters were the best predictors for palbociclib and panobinostat responses to which mutation status had no effect on model fit. Comparison of clusters in which panobinostat and palbociclib showed opposite drug responses revealed that blast morphology was different, linking maturation level to differential drug response. Furthermore, their drug targets were differentially expressed in these clusters, pointing out the importance of integrating context and drug target expression for in silico drug screening. Surprisingly, the HDAC expression pattern revealed cytosolic members (HDAC6 and HDAC10; ref. 62) as resistance markers, while nuclear HDAC4 and HDAC9 correlated with sensitivity. Our analysis also supported CDK4/6 as disease-specific targets that are known to act as critical activators of normal and leukemic hematopoietic stem cells (63). Here, palbociclib compared favorably to idarubicin regarding patient blast sensitivity against normal bone marrow cells, reflected in mRNA data from Hemap. The selectivity over normal cells may improve further using combination therapy (63) that allows decreasing the dose. However, additional parameters such as drug target protein level, drug metabolism, and cell proliferation rate further contribute to sensitivity and therefore not all patients matching a molecular subtype or expressing the target mRNA can be expected to respond favorably.

Cancer cells display remarkable plasticity: resistance to recently approved CD19-targeting CAR-T therapy has been shown to occur via mutations or splicing defects at the CD19 locus or lineage switching (49). To combat the diversity of resistance mechanisms, there is a demand to diversify the target repertoire. In pre-B-ALL, we identified promising surface protein candidates, prioritizing targets with consistently high levels within the given disease context, and low levels in normal blood cell types. Over 75% of the highly ranked candidates were confirmed using proteomics (51), and additional literature confirmation was found for five candidates. Moreover, we validated DPEP1 as a potential surface target in pre-B-ALL by IHC staining of diagnostic bone marrow biopsies. Positive staining was found in each subtype for majority of cases, except in MLL where both the Hemap gene expression data and protein staining indicated low or undetectable levels. The validation cohort consisted of pediatric cases, while Hemap analysis included also adult samples. DPEP1 is a zinc-dependent metalloproteinase that is expressed aberrantly in several cancers, and has been implicated as a potential therapeutic target in colon and pancreatic cancers (64, 65). In future, increased availability of protein-level data from different hematologic malignancies will allow confirming additional targets.

In conclusion, the interactive Hemap resource facilitates comparative analyses across multiple hematologic malignancies. We envision that the mechanistic insight gained by concomitant identification of molecular subtypes, genetic aberrations, and derailed cellular pathways will expedite therapeutic innovations and clinical utility.

No potential conflicts of interest were disclosed.

Conception and design: P. Pölönen, K. Porkka, M. Nykter, M. Heinäniemi

Development of methodology: P. Pölönen, J. Mehtonen, J. Lin, S. Häyrynen, M. Nykter

Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): A. Mäkinen, D. Malani, C.A. Heckman, O. Lohi

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): P. Pölönen, J. Mehtonen, J. Lin, T. Liuksiala, S. Häyrynen, A. Mäkinen, A. Kumar, V. Pohjolainen, K. Porkka, K.J. Granberg, O. Lohi, M. Nykter, M. Heinäniemi

Writing, review, and/or revision of the manuscript: P. Pölönen, J. Mehtonen, J. Lin, S. Häyrynen, S. Teppo, A. Mäkinen, K. Porkka, C.A. Heckman, P. May, K.J. Granberg, O. Lohi, M. Nykter, M. Heinäniemi

Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): S. Teppo, A. Mäkinen, P. May, K.J. Granberg

Study supervision: V. Hautamäki, K.J. Granberg, O. Lohi, M. Nykter, M. Heinäniemi

We would like to thank the Finnish Pharmaceutical Information Centre Ltd for information provided regarding approved drugs in hematologic malignancies, Dr. Aik Choon Tan for the advice with drug signature gene sets, staff of the FIMM Technology Center High-Throughput Biomedicine Unit and Sequencing Lab, members of the T. Aittokallio, C.A. Heckman, O. Kallioniemi, S. Mustjoki, K. Porkka, and K. Wennerberg groups for their help to analyze drug response profiles, and Prof. I. Shmulevich, Dr. Sheila Reynolds, and Roger Kramer for assistance with feature matrix analysis. The work was supported by grants from the Academy of Finland (project no. 312043 to M. Nykter; 310829 to M. Nykter; 259038 to K.J. Granberg; 276634 to M. Heinäniemi 277816 to O. Lohi), The Finnish Cultural Foundation (Interdisciplinary Science Workshops, to M. Heinäniemi), Sigrid Juselius Foundation (to M. Nykter), and Cancer Society of Finland (to M. Nykter, M. Heinäniemi, O. Lohi, K.J. Granberg), Paulo Foundation (to O. Lohi), Foundation for Pediatric Research (to O. Lohi), Jane and Aatos Erkko Foundation (to O. Lohi), Nokia Foundation (to V. Hautamäki), and University of Eastern Finland (to M. Heinäniemi). We thank CSC–IT Center for Science and UEF bioinformatics center for providing computational resources.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1.
Orr
MS
,
Scherf
U
. 
Large-scale gene expression analysis in molecular target discovery
.
Leukemia
2002
;
16
:
473
7
.
2.
Ylipää
A
,
Yli-Harja
O
,
Zhang
W
,
Nykter
M
. 
Characterization of aberrant pathways across human cancers
.
BMC Syst Biol
2013
;
7
:
S1
.
3.
Kumar
SK
,
Rajkumar
V
,
Kyle
RA
,
van Duin
M
,
Sonneveld
P
,
Mateos
MV
, et al
Multiple myeloma
.
Nat Rev Dis Primers
2017
;
3
:
17046
.
4.
Nangalia
J
,
Green
AR
. 
Myeloproliferative neoplasms: from origins to outcomes
.
Blood
2017
;
130
:
2475
83
.
5.
June
CH
,
O'Connor
RS
,
Kawalekar
OU
,
Ghassemi
S
,
Milone
MC
. 
CAR T cell immunotherapy for human cancer
.
Science
2018
;
359
:
1361
65
.
6.
McCabe
B
,
Liberante
F
,
Mills
KI
. 
Repurposing medicinal compounds for blood cancer treatment
.
Ann Hematol
2015
;
94
:
1267
76
.
7.
Corces-Zimmerman
MR
,
Hong
WJ
,
Weissman
IL
,
Medeiros
BC
,
Majeti
R
. 
Preleukemic mutations in human acute myeloid leukemia affect epigenetic regulators and persist in remission
.
Proc Natl Acad Sci U S A
2014
;
111
:
2548
53
.
8.
Iorio
F
,
Knijnenburg
TA
,
Vis
DJ
,
Bignell
GR
,
Menden
MP
,
Schubert
M
, et al
A landscape of pharmacogenomic interactions in cancer
.
Cell
2016
;
166
:
740
54
.
9.
Edgar
R
,
Domrachev
M
,
Lash
AE
. 
Gene Expression Omnibus: NCBI gene expression and hybridization array data repository
.
Nucleic Acids Res
2002
;
30
:
207
10
.
10.
Irizarry
RA
,
Bolstad
BM
,
Collin
F
,
Cope
LM
,
Hobbs
B
,
Speed
TP
. 
Summaries of Affymetrix GeneChip probe level data
.
Nucleic Acids Res
2003
;
31
:
e15
.
11.
Eklund
AC
,
Szallasi
Z
. 
Correction of technical bias in clinical microarray data improves concordance with known biological information
.
Genome Biol
2008
;
9
:
R26
.
12.
Tyner
JW
,
Tognon
CE
,
Bottomly
D
,
Wilmot
B
,
Kurtz
SE
,
Savage
SL
, et al
Functional genomic landscape of acute myeloid leukaemia
.
Nature
2018
;
562
:
526
31
.
13.
Lawrence
ND
. 
Gaussian process latent variable models for visualization of high dimensional data
.
Adv Neural Inf Process Syst
2004
;
16.3
:
329
36
.
14.
Roweis
ST
,
Lawrence
KS
. 
Nonlinear dimensionality reduction by locally linear embedding
.
Science
2000
;
290
:
2323
6
.
15.
Hotelling
H
. 
Analysis of a complex of statistical variables into principal components
.
J Educ Psychol
1933
;
24
:
417
.
16.
Tipping
ME
,
Bishop
CM
. 
Probabilistic principal component analysis
.
J Roy Stat Soc B
1999
;
61
:
611
22
.
17.
Sammon
JW
. 
A nonlinear mapping for data structure analysis
.
IEEE T Comput
1969
;
5
:
401
9
.
18.
van der Maaten
L
,
Hinton
G
. 
Visualizing data using t-SNE
.
J Mach Learn Res
2008
;
9
:
2579
605
.
19.
van der Maaten
L
. 
Accelerating t-SNE using Tree-Based Algorithms
.
J Mach Learn Res
2014
;
15
:
1
21
.
20.
Mehtonen
J
,
Pölönen
P
,
Häyrynen
S
,
Dufva
O
,
Lin
J
,
Liuksiala
T
, et al
Data-driven characterization of molecular phenotypes across heterogenous sample collections
.
Nucleic Acids Research, gkz281
. .
21.
Cancer Genome Atlas Research Network
. 
Comprehensive molecular characterization of gastric adenocarcinoma
.
Nature
2014
;
513
:
202
9
.
22.
Subramanian
A
,
Tamayo
P
,
Mootha
VK
,
Mukherjee
S
,
Ebert
BL
,
Gillette
MA
, et al
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
.
Proc Natl Acad Sci U S A
2005
;
102
:
15545
50
.
23.
Kutmon
M
,
Riutta
A
,
Nunes
N
,
Hanspers
K
,
Willighagen
EL
,
Bohler
A
, et al
WikiPathways: capturing the full diversity of pathway knowledge
.
Nucleic Acids Res
2015
;
44
:
488
94
.
24.
Duarte
NC
,
Becker
SA
,
Jamshidi
N
,
Thiele
I
,
Mo
ML
,
Vo
TD
, et al
Global reconstruction of the human metabolic network based on genomic and bibliomic data
.
Proc Natl Acad Sci U S A
2007
;
104
:
1777
82
.
25.
Cerami
EG
,
Gross
BE
,
Demir
E
,
Rodchenkov
I
,
Babur
O
,
Anwar
N
, et al
Pathway Commons, a web resource for biological pathway data
.
Nucleic Acids Res
2011
;
39
:
685
90
.
26.
Yoo
M
,
Shin
J
,
Kim
J
,
Ryall
KA
,
Lee
K
,
Lee
S
, et al
DSigDB: drug signatures database for gene set analysis
.
Bioinformatics
2015
;
31
:
3069
71
.
27.
Hänzelmann
S
,
Castelo
R
,
Guinney
J
. 
GSVA: gene set variation analysis for microarray and RNA-Seq data
.
BMC Bioinformatics
2013
;
14
:
7
.
28.
Bento
AP
,
Gaulton
A
,
Hersey
A
,
Bellis
LJ
,
Chambers
J
,
Davies
M
, et al
The ChEMBL bioactivity database: an update
.
Nucleic Acids Res
2014
;
42
:
D1083
90
.
29.
Yang
H
,
Qin
C
,
Li
YH
,
Tao
L
,
Zhou
J
,
Yu
CY
, et al
Therapeutic target database update 2016: enriched resource for bench to clinical drug target and targeted pathway information
.
Nucleic Acids Res
2016
;
44
:
D1069
74
.
30.
Cotto
KC
,
Wagner
AH
,
Feng
YY
,
Kiwala
S
,
Coffman
AC
,
Spies
G
, et al
DGIdb 3.0: a redesign and expansion of the drug–gene interaction database
.
Nucleic Acids Res
2018
;
46
:
D1068
73
.
31.
Santos
R
,
Ursu
O
,
Gaulton
A
,
Bento
AP
,
Donadi
RS
,
Bologa
CG
, et al
A comprehensive map of molecular drug targets
.
Nat Rev Drug Discov
2017
;
16
:
19
34
.
32.
Bausch-Fluck
D
,
Hofmann
A
,
Bock
T
,
Frei
AP
,
Cerciello
F
,
Jacobs
A
, et al
A mass spectrometric-derived cell surface protein atlas
.
PLoS One
2015
;
10
:
e0121314
.
33.
Frismantas
V
,
Dobay
MP
,
Rinaldi
A
,
Tchinda
J
,
Dunn
SH
,
Kunz
J
, et al
Ex vivo drug response profiling detects recurrent sensitivity patterns in drug-resistant acute lymphoblastic leukemia
.
Blood
2017
;
129
:
e26
37
.
34.
Friedman
J
,
Hastie
T
,
Tibshirani
R
. 
Regularization paths for generalized linear models via coordinate descent
.
J Stat Softw
2010
;
33
:
1
22
.
35.
Kuhn
M
. 
Building predictive models in R using the caret package
.
J Stat Softw
2008
;
28
:
1
26
.
36.
Yadav
B
,
Pemovska
T
,
Szwajda
A
,
Kulesskiy
E
,
Kontro
M
,
Karjalainen
R
, et al
Quantitative scoring of differential drug sensitivity for individually optimized anticancer therapies
.
Sci Rep
2014
;
4
:
5193
.
37.
Bhanot
H
,
Weisberg
EL
,
Reddy
MM
,
Nonami
A
,
Neuberg
D
,
Stone
RM
, et al
Acute myeloid leukemia cells require 6-phosphogluconate dehydrogenase for cell growth and NADPH-dependent metabolic reprogramming
.
Oncotarget
2017
;
8
:
67639
50
.
38.
Mizuno
H
,
Kagoya
Y
,
Koya
J
,
Masamoto
Y
,
Kurokawa
M
. 
Activated pentose phosphate pathway mediated by Fbp-1 upregulation supports progression of acute myeloid leukemia with high EVI-1 expression
.
Blood
2018
;
132
:
757
.
39.
Drillenburg
P
,
Pals
ST
. 
Cell adhesion receptors in lymphoma dissemination
.
Blood
2000
;
95
:
1900
10
.
40.
Mittermayr
S
,
Leê
GN
,
Clarke
C
,
Millán Martín
S
,
Larkin
AM
,
O'Gorman
P
, et al
Polyclonal immunoglobulin GN-glycosylation in the pathogenesis of plasma cell disorders
.
J Proteome Res
2016
;
16
:
748
62
.
41.
Pang
X
,
Li
H
,
Guan
F
,
Li
X
. 
Multiple roles of glycans in hematological malignancies
.
Front Oncol
2018
;
8
:
364
.
42.
Diakos
C
,
Xiao
Y
,
Zheng
S
,
Kager
L
,
Dworzak
M
,
Wiemels
JL
. 
Direct and indirect targets of the E2A-PBX1 leukemia-specific fusion protein
.
PLoS One
2014
;
9
:
e87602
.
43.
Karvonen
H
,
Perttilä
R
,
Niininen
W
,
Hautanen
V
,
Barker
H
,
Murumägi
A
, et al
Wnt5a and ROR1 activate non-canonical Wnt signaling via RhoA in TCF3-PBX1 acute lymphoblastic leukemia and highlight new treatment strategies via Bcl-2 co-targeting
.
Oncogene
2019
Jan 10 [Epub ahead of print]
.
44.
Huether
R
,
Dong
L
,
Chen
X
,
Wu
G
,
Parker
M
,
Wei
L
, et al
The landscape of somatic mutations in epigenetic regulators across 1,000 paediatric cancer genomes
.
Nat Commun
2014
;
5
:
3630
.
45.
Ramsay
AJ
,
Martínez-Trillos
A
,
Jares
P
,
Rodríguez
D
,
Kwarciak
A
,
Quesada
V
. 
Next-generation sequencing reveals the secrets of the chronic lymphocytic leukemia genome
.
Clin Transl Oncol
2013
;
15
:
3
8
.
46.
Rendeiro
AF
,
Schmidl
C
,
Strefford
JC
,
Walewska
R
,
Davis
Z
,
Farlik
M
, et al
Chromatin accessibility maps of chronic lymphocytic leukaemia identify subtype-specific epigenome signatures and transcription regulatory networks
.
Nat Commun
2016
;
7
:
11938
.
47.
Robak
T
,
Blonski
JZ
,
Robak
P
. 
Antibody therapy alone and in combination with targeted drugs in chronic lymphocytic leukemia
.
Semin Oncol
2016
;
43
:
280
90
.
48.
Maude
SL
,
Frey
N
,
Shaw
PA
,
Aplenc
R
,
Barrett
DM
,
Bunin
NJ
, et al
Chimeric antigen receptor T cells for sustained remissions in leukemia
.
N Engl J Med
2014
;
371
:
1507
17
.
49.
Jacoby
E
,
Nguyen
SM
,
Fountaine
TJ
,
Welp
K
,
Gryder
B
,
Qin
H
, et al
CD19 CAR immune pressure induces B-precursor acute lymphoblastic leukaemia lineage switch exposing inherent leukaemic plasticity
.
Nat Commun
2016
;
7
:
12320
.
50.
Laszlo
GS
,
Estey
EH
,
Walter
RB
. 
The past and future of CD33 as therapeutic target in acute myeloid leukemia
.
Blood Rev
2014
;
28
:
143
53
.
51.
Mirkowska
P
,
Hofmann
A
,
Sedek
L
,
Slamova
L
,
Mejstrikova
E
,
Szczepanski
T
, et al
Leukemia surfaceome analysis reveals new disease-associated features
.
Blood
2013
;
121
:
e149
59
.
52.
Nelson
MR
,
Tipney
H
,
Painter
JL
,
Shen
J
,
Nicoletti
P
,
Shen
Y
, et al
The support of human genetic evidence for approved drug indications
.
Nat Genet
2015
;
47
:
856
60
.
53.
Amir
el-AD
,
Davis
KL
,
Tadmor
MD
,
Simonds
EF
,
Levine
JH
,
Bendall
SC
, et al
viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia
.
Nat Biotechnol
2013
;
31
:
545
52
.
54.
Shekhar
K
,
Brodin
P
,
Davis
MM
,
Chakraborty
AK
. 
Automatic classification of cellular expression by nonlinear stochastic embedding (ACCENSE)
.
Proc Natl Acad Sci U S A
2014
;
111
:
202
7
.
55.
Smith
EN
,
Ghia
EM
,
DeBoever
CM
,
Rassenti
LZ
,
Jepsen
K
,
Yoon
KA
, et al
Genetic and epigenetic profiling of CLL disease progression reveals limited somatic evolution and suggests a relationship to memory-cell development
.
Blood Cancer J
2015
;
5
:
e303
.
56.
Anderson
MA
,
Huang
D
,
Roberts
A
. 
Targeting BCL2 for the treatment of lymphoid malignancies
.
Semin Hematol
2014
;
51
:
219
27
.
57.
Roberts
AW
,
Davids
MS
,
Pagel
JM
,
Kahl
BS
,
Puvvada
SD
,
Gerecitano
JF
, et al
Targeting BCL2 with venetoclax in relapsed chronic lymphocytic leukemia
.
N Engl J Med
2016
;
374
:
311
22
.
58.
Benito
JM
,
Godfrey
L
,
Kojima
K
,
Hogdal
L
,
Wunderlich
M
,
Geng
H
, et al
MLL-rearranged acute lymphoblastic leukemias activate BCL-2 through H3K79 methylation and are sensitive to the BCL-2-specific antagonist ABT-199
.
Cell Rep
2015
;
13
:
2715
27
.
59.
Andreeff
M
,
Kelly
KR
,
Yee
K
,
Assouline
S
,
Strair
R
,
Popplewell
L
, et al
Results of the phase I trial of RG7112, a small-molecule MDM2 antagonist in leukemia
.
Clin Cancer Res
2015
;
22
:
868
76
.
60.
Kaindl
U
,
Morak
M
,
Portsmouth
C
,
Mecklenbräuker
A
,
Kauer
M
,
Zeginigg
M
, et al
Blocking ETV6/RUNX1-induced MDM2 overexpression by Nutlin-3 reactivates p53 signaling in childhood leukemia
.
Leukemia
2014
;
28
:
600
8
.
61.
Laukkanen
S
,
Grönroos
T
,
Pölönen
P
,
Kuusanmäki
H
,
Mehtonen
J
,
Cloos
J
, et al
In silico and preclinical drug screening identifies dasatinib as a targeted therapy for T-ALL
.
Blood Cancer J
2017
;
7
:
e604
.
62.
Boyault
C
,
Sadoul
K
,
Pabion
M
,
Khochbin
S
. 
HDAC6, at the crossroads between cytoskeleton and cell signaling by acetylation and ubiquitination
.
Oncogene
2007
;
26
:
5468
76
.
63.
Yang
C
,
Boyson
CA
,
Di Liberto
M
,
Huang
X
,
Hannah
J
,
Dorn
DC
, et al
CDK4/6 inhibitor PD 0332991 sensitizes acute myeloid leukemia to cytarabine-mediated cytotoxicity
.
Cancer Res
2015
;
75
:
1838
45
.
64.
Zhang
G
,
Schetter
A
,
He
P
,
Funamizu
N
,
Gaedcke
J
,
Ghadimi
BM
, et al
DPEP1 inhibits tumor cell invasiveness, enhances chemosensitivity and predicts clinical outcome in pancreatic ductal adenocarcinoma
.
PLoS One
2012
;
7
:
e31507
.
65.
Eisenach
PA
,
Soeth
E
,
Röder
C
,
Klöppel
G
,
Tepel
J
,
Kalthoff
H
, et al
Dipeptidase 1 (DPEP1) is a marker for the transition from low-grade to high-grade intraepithelial neoplasia and an adverse prognostic factor in colorectal cancer
.
Br J Cancer
2013
;
109
:
694
703
.