Background:

Transcriptome studies are gaining momentum in genomic epidemiology, and the need to incorporate these data in multivariable models alongside other risk factors brings demands for new approaches.

Methods:

Here we describe SPECTRA, an approach to derive quantitative variables that capture the intrinsic variation in gene expression of a tissue type. We applied the SPECTRA approach to bulk RNA sequencing from malignant cells (CD138+) in patients from the Multiple Myeloma Research Foundation CoMMpass study.

Results:

A set of 39 spectra variables were derived to represent multiple myeloma cells. We used these variables in predictive modeling to determine spectra-based risk scores for overall survival, progression-free survival, and time to treatment failure. Risk scores added predictive value beyond known clinical and expression risk factors and replicated in an external dataset. Spectrum variable S5, a significant predictor for all three outcomes, showed pre-ranked gene set enrichment for the unfolded protein response, a mechanism targeted by proteasome inhibitors which are a common first line agent in multiple myeloma treatment. We further used the 39 spectra variables in descriptive modeling, with significant associations found with tumor cytogenetics, race, gender, and age at diagnosis; factors known to influence multiple myeloma incidence or progression.

Conclusions:

Quantitative variables from the SPECTRA approach can predict clinical outcomes in multiple myeloma and provide a new avenue for insight into tumor differences by demographic groups.

Impact:

The SPECTRA approach provides a set of quantitative phenotypes that deeply profile a tissue and allows for more comprehensive modeling of gene expression with other risk factors.

Transcriptomes represent the combined effects of inherited, somatic, and epigenetic variation and can provide insight into genetic and environmental risk factors. Gene expression studies are gaining momentum in genomic epidemiology (1–5). The need to incorporate transcriptome data in multivariable models with other risk factors, and for improved representation of disease complexity for insight into effects of multiple risk factors, brings new demands for approaches to describe transcriptomes.

Many current approaches are focused on immediate biological interpretation and constrained to biological expectations or reduce the rich and complex data to a single categorical variable (e.g., intrinsic subtypes). Although these have advanced knowledge of disease mechanisms (6–8) and identified important high-level differences in disease (9–11), complementary approaches are needed to go deeper and increase flexibility and application. In common disease, the sources of heterogeneity are many, complex, and often poorly understood. Latent variables, focused on capturing signal and not immediate interpretability, provide the potential for new discoveries. Transcriptome characterization using multiple intrinsic quantitative variables has the potential to provide a deeper and more informative dive into the transcriptome.

Principal component analysis (PCA) is a classic approach to capture variance in high-dimensional data. Our previous work with well-curated gene-panel expression data illustrated that PCA holds substantial promise to provide informative expression variables (12). In Madsen and colleagues, we used a population-based dataset of breast tumors and PCA to reduce the 50-gene space of the PAM50 gene panel to five quantitative multi-gene expression variables (12). When implemented as predictor variables in an independent clinical trial dataset, PAM50 PCA variables were able to predict prognosis and response to paclitaxel, adding value beyond clinical risk factors and outperforming intrinsic subtypes (13). When utilized as quantitative tumor phenotypes, PAM50 PCA variables were superior to the standard PAM50 subtypes for gene mapping (12, 14).

The timely extension to whole transcriptomes described here requires careful data curation or spurious variance will be incorporated in the derived variables. We address this limitation through stringent data culling, quality control, batch correction, zero-handling, and normalization. Here, we describe our approach, called SPECTRA, to derive a quantitative data framework for whole transcriptomes.

We illustrate our whole transcriptome SPECTRA approach using bulk RNA-sequencing (RNA-seq) data from CD138+ sorted myeloma cells from the Multiple Myeloma Research Foundation (MMRF) CoMMpass Study (15). Multiple myeloma is a malignancy of the plasma cells with one of the poorest 5-year survival (55.6%) for adult-onset hematologic malignancies (https://seer.cancer.gov/statfacts/html/mulmy.html). It is most frequently diagnosed at ages 65 to 74 years (median 69 years). Incidence is higher in men (8.7 men vs. 5.6 women per 100,000) and particularly high in patients self-reporting as African American (AA men 16.3 and AA women 11.9 per 100,000).

We use the SPECTRA approach to derive quantitative CD138+ transcriptome variables, hereafter referred to as spectra, which we use in predictive and descriptive regression models to predict clinical endpoints, derive latent risk groups, compare with established expression-based molecular risk scores, and describe associations with clinical and demographic characteristics.

SPECTRA approach

The motivation is the derivation of informative quantitative phenotypes from RNA-seq data that can be used as independent variables for a variety of outcomes (Supplementary Materials and Methods; Supplementary Fig. S1). As an agnostic technique seeking to identify intrinsic characteristics of a tissue, the goal is to retain only those aspects of the RNA-seq data that can represent meaningful variance. Accordingly, the SPECTRA approach encompasses stringent quality control, normalization, and batch correction prior to derivation and dimensions reduction using PCA (Fig. 1; Supplementary Materials and Methods). Genes likely to lack precision are removed and only coding genes with sufficient coverage across the dataset are considered (Supplementary Materials and Methods). Internal (single sample) normalization accounts for gene length, sequencing depth (library size), and RNA composition (Supplementary Materials and Methods). This normalization avoids the need for reference samples, real or synthetic, and provides the potential for spectra to be ported to follow-up samples and external datasets. Skew and outliers are accounted for (Supplementary Materials and Methods). Using the normalized data, we then adjust for sequence batch using ComBat (16), with covariates included for patient characteristics that are unbalanced by batch (Supplementary Materials and Methods). Now that spurious variation has been removed, we can derive SPECTRA variables using PCA to describe intrinsic gene expression phenotypes. PCA is implemented with the covariance matrix—the expression values are centered on the mean across individuals for each gene (Supplementary Materials and Methods). Eigenvectors contain the gene loadings that define the linear transformation used for each spectrum. A corresponding eigenvalue, |$\lambda $|⁠, indicates the proportion of the global variance represented by the transformed value for an eigenvector. We use a scree test (inflection point of the rank-ordered plot of the |$\lambda $|⁠, or elbow method; ref. 17) to select the |$k$| components to retain. The proportion of variance explained by this |$k$|-dimensional space ( |$\mathop \sum \nolimits_{i\ = \ 1}^k {\lambda }_i/\mathop \sum \nolimits_{\forall i} {\lambda }_i)$| indicates the depth of the dive into the transcriptome data. Spectra variables are the standardized retained components (S1, …, Sk).

Figure 1.

Overview of SPECTRA framework to derive spectra variables and example applications using spectra variables.

Figure 1.

Overview of SPECTRA framework to derive spectra variables and example applications using spectra variables.

Close modal

Spectra variables are orthogonal, each subsequently going deeper into the global variance, providing additional coverage of the transcriptome. The PCA rotation matrix describes the multigene linear transformations that produce the quantitative SPECTRA variables, or spectra, that together provide a new data framework for the expression space. As unsupervised variables, they are intrinsic and can be used as predictors for any outcome and as novel phenotypes. The PCA rotation matrix can be applied to the original data to produce spectra values for the derivation samples (transformed data matrix), to subsequent samples to identify changes over time, and to external data for replication.

Myeloma CD138+ spectra

Data from the MMRF CoMMpass Study (release IA14; ref. 15) were downloaded from the MMRF web portal (https://research.themmrf.org/). Clinical data and CD138+ RNA-seq were available for 768 patients prior to treatment at study entry (baseline) and 119 follow-up bone marrow samples. Transcript-based expression estimates processed by Salmon (version 0.7.2) were used. We applied the SPECTRA approach on baseline samples (n = 768) to derive a framework of SPECTRA variables for CD138+ expression. Although not used for spectra derivation, follow-up samples were batch-corrected alongside the baseline samples. Covariates included in batch correction were age, gender, overall survival (OS), progression-free survival (PFS), time to first-line treatment failure(TTF), and treatment status. The first 39 components were selected based on the screen test (17) and standardized to produce spectra variables S1, …, S39. Spectra values for all CoMMpass samples are provided in Supplementary Materials and Methods (see Data and Code Availability).

Descriptive modeling of clinical and demographic factors

We used descriptive modeling to illustrate associations between the CD138+ spectra and clinical and demographic factors with elevated myeloma risk: tumor aberrations [high risk del(17p), t(14;16), gain 1q, and t(4;14), and standard risk t(11;14); ref. 18], age, gender, race. Linear or logistic regression was used with all 39 CD138+ spectra entered in the model as independent variables. Spectra were noted as significant if P < 0.05.

Predictive modeling of clinical outcomes

Predictive modeling for OS, PFS, and TTF used Cox proportional hazards regression, implemented in the survival package in R (19, 20). An overview of predictive modeling is provided in Supplementary Fig. S2. For each model, we used single step selection for retention of predictor variables. All 39 spectra were entered in the model initially, but only retained if significant (P < 0.05). The linear predictor from the reduced model was used to define a quantitative spectra risk score for the outcome (i.e., the weighted sum of the spectra retained in the model based on their coefficients: |$\mathop \sum \nolimits_j {\beta }_j{S}_j$|⁠).

To determine if a quantitative spectra risk score held evidence of separate risk groups, the risk score distribution and normal Quantile-Quantile plots were inspected. The mclust R package was used to statistically assess evidence for separate groups using density estimation by Gaussian finite Mixture Modeling (GMM; refs. 21, 22) assuming equal variances. Bayesian information content (BIC) and bootstrap likelihood ratio tests (LRT) were used to determine the number of groups (mclustBIC, mclustBootstrapLRT). The HR between risk groups was calculated using relative event rates (23) from survdiff in the survival package in R (19).

Internal validation

To address overfitting and lack of a replication sample, bootstrap internal validation was used (24). Bootstrap internal validation involves replicating the entire modeling process on bootstrap resamples of the data to determine and adjust for overfitting. Classical bootstrapping was used, with resampling to the original data size in each permutation. Here, the process includes single step variable selection, GMM density estimation for risk group designation, and adjustment of HRs, C-index, and bootstrap confidence limits (25). Initial risk estimates from a model are referred to as “apparent.” The overfitting metric is called “optimism.” The optimism-corrected estimates are referred to as “adjusted” values and are a more reasonable assessment of effect size. Adjusted confidence intervals that do not contain 1.0 indicate validation.

Added predictive value

We calculated the added predictive value (APV; refs. 24, 26, 27) of each quantitative risk score beyond that predicted from established clinical predictors. The APV compares the LRT statistic for a model including established clinical predictors to a model which also includes each expression risk score. With larger independent predictive contributions, APV increases, to a maximum of 1.0.

Comparison to the UAMS risk score

We compared spectra-based risk to the most widely adopted supervised expression risk score in myeloma, from the University of Arkansas for Medical Sciences (UAMS; ref. 11). The UAMS risk score was developed in microarray data and tested 54,657 probes for association with disease-related survival (11). A total of 68 genes were selected (18 under and 51 overexpressed prognostic genes). The UAMS risk score is the ratio of mean expression of the upregulated to downregulated genes, with k-means clustering to determine a cutoff for “high-risk” classification (11). We calculated each patient's UAMS risk score and their risk status (low or high) using the CoMMpass RNA-seq data.

Replication

We used the microarray gene expression data from Shaughnessy and colleagues 2007 (NCBI GEO accession no.: GSE2658) for external replication of the OS results. We retrieved the expression and survival data for the 559 patients in this dataset using the GEOquery R package (28). To calculate the 39 established spectra variables in the GSE2658 data, we mapped probesets from the microarray data to genes in the CoMMpass spectra rotation matrix by gene name which resulted in expression values for 7,366 of the 7,448 unique gene names. Genes that were missing from GSE2658 and NA values were replaced with zeros. Because probesets do not directly correspond to genes or transcripts, we had to adapt our normalization approach slightly. We summed all expression values for all probesets for each gene, then normalized that sum as the expression for that gene as described in the Supplementary Materials and Methods using the gene length calculated from the CoMMpass annotation file. The normalized expression values were then centered using the gene means from the GSE2658 data set and matrix multiplied by the CoMMpass spectra rotation matrix. We used the spectra gene weights derived from CoMMpass to calculate spectra in GSE2658. The resulting spectra scores for GSE2658 were scaled by the standard deviations from the CoMMpass spectra. We used the coefficients from the CoMMpass OS Cox model to assign an OS spectra-based risk score to each patient in GSE2658 and we used the threshold from the CoMMpass GMM to separate the patients into high- and low-risk groups. The UAMS risk score was calculated based on the gene expression values from GSE2658. All HRs were calculated as described above.

Gene set enrichment analysis

Each spectra variable is a linear transformation based on gene weights from its eigenvector. Gene weights can be positive or negative and order the importance and direction of effect of genes in a spectrum, and thus are an ideal ranking metric for gene enrichment. Pre-ranked gene set enrichment analysis (GSEA; RRID:SCR_003199) for Hallmark pathways was performed using fast-GSEA (bioRxiv 2021.10.1101.060012; ref.29) with ranking based on spectra genes weights.

CD138+ spectra in follow-up samples

To illustrate the potential to track changes over time, OS spectra risk scores were calculated in longitudinal follow-up samples. For these samples, normalized and batch-corrected gene expression values were centered to the derivation set and spectra values (Sj) determined using the previously established linear transformations and scaling. The OS risk score was calculated using the established prediction model ( |$\mathop \sum \nolimits_j {\beta }_j{S}_j$|⁠).

Data availability

Processed RNA-seq data from the CoMMpass Study can be downloaded from https://research.themmrf.org/. Code used to derive the CD138+ transcriptome spectra and generate the myeloma results is freely available on GitHub: https://github.com/njcamp-lab/MM_spectra. Details of the QC process and the transcriptome framework (linear equations for the gene transformations) necessary to calculate the 39-spectra variables in other studies are also provided. Spectra for individuals in the IA14 CoMMpass data can be downloaded from https://github.com/njcamp-lab/MM_spectra/tree/main/SpectraData.

Thirty-nine spectra describe the gene expression landscape of multiple myeloma tumors

We applied the SPECTRA approach to RNA-seq from treatment naïve CD138+ cells collected at diagnosis from 768 patients in the MMRF CoMMpass study (15). After quality control, gene expression values for 7,449 genes (56,339 transcripts) in 767 patients were normalized and batch corrected. After PCA, the first 39 components were selected based on a screen test (Supplementary Fig. S3) and standardized to create spectra S1 to S39. This 39-dimension variable framework explains 65% of the global expression variation (Fig. 2A). As linearly uncorrelated variables each of the 39 CD138+ spectra capture a different source of variance with potential to describe patient differences (Fig. 2B).

Figure 2.

Spectra barcodes in 4 CoMMpass patients. A, Percent of transcriptome-wide variance between samples captured by each spectrum. Together the spectra capture 65% of the transcriptome-wide variance. B, For each patient, all 39 spectra are illustrated with the value represented by the bar intensity. The color indicates if the patient's spectra value is positive (blue) or negative (green). Each patient has a unique profile across the 39 spectra. At a high-level patients 2497 and 1854 may appear most similar (mostly green) and 2394 and 1392 (mostly blue). However, at a finer resolution, similarities vary. For example, for spectrum S1 and S2, patients 2497 and 1854 are quite different.

Figure 2.

Spectra barcodes in 4 CoMMpass patients. A, Percent of transcriptome-wide variance between samples captured by each spectrum. Together the spectra capture 65% of the transcriptome-wide variance. B, For each patient, all 39 spectra are illustrated with the value represented by the bar intensity. The color indicates if the patient's spectra value is positive (blue) or negative (green). Each patient has a unique profile across the 39 spectra. At a high-level patients 2497 and 1854 may appear most similar (mostly green) and 2394 and 1392 (mostly blue). However, at a finer resolution, similarities vary. For example, for spectrum S1 and S2, patients 2497 and 1854 are quite different.

Close modal

Individual spectra associate with multiple myeloma demographic characteristics and tumor aberrations

Figure 3 illustrates associations between the CD138+ spectra and tumor aberrations or demographic groups with elevated multiple myeloma risk in multivariate descriptive models. A single full 39 spectra model was fit for each characteristic; only significant spectra in a model are shown in Fig. 3. In logistic regression models, each tumor aberration showed different significantly associated spectra, with some spectra unique to only one aberration. Linear regression with age at diagnosis as a quantitative outcome highlighted associations with thirteen spectra. Seven spectra were significantly associated with gender and 13 spectra significant with race (self-reported black or white; other racial categories too small to consider).

Figure 3.

Overview of CD138+ spectra associations. Multivariable modeling results for demographic characteristics, tumor FISH aberrations, and outcomes. All 39 spectra (x-axis) were included as dependent variables for each independent variable (y-axis). Linear regression model for age (in years). Logistic regression models for gender (male as baseline), race (white as baseline), chromosomal aberrations (lack of aberration as baseline). Cox proportional regression models for clinical outcomes (in years). The significance level (intensity) and direction of each spectrum are indicated: green negative β coefficient, purple positive β coefficient.

Figure 3.

Overview of CD138+ spectra associations. Multivariable modeling results for demographic characteristics, tumor FISH aberrations, and outcomes. All 39 spectra (x-axis) were included as dependent variables for each independent variable (y-axis). Linear regression model for age (in years). Logistic regression models for gender (male as baseline), race (white as baseline), chromosomal aberrations (lack of aberration as baseline). Cox proportional regression models for clinical outcomes (in years). The significance level (intensity) and direction of each spectrum are indicated: green negative β coefficient, purple positive β coefficient.

Close modal

A nine spectra model predicts multiple myeloma OS

In the CoMMpass data, nine spectra significantly predicted OS in multivariate Cox regression (apparent C-index = 0.70; Supplementary Table S1). Internal validation (1,000 bootstraps) estimated C-optimism = 0.04, leading to an adjusted C-index (Cadj) of 0.66 (0.65–0.74). Gaussian mixture modeling on the OS spectra risk scores identified two risk groups (P = 0.001, Fig. 4A) with 88 patients in the high-risk group (58 events, median OS 20.1 months) and 676 patients in the low-risk group (121 events, median OS not reached after 78 months, Fig. 4B). After internal validation, the optimism-adjusted hazard ratio (HRadj) for the high- compared with low-risk patients was 4.27 (2.31–12.69; Table 1). For comparison, in the CoMMpass data UAMS OS HRadj = 3.89 (2.53–5.45; Table 1).

Figure 4.

OS spectra predictive modeling. CoMMpass derivation data. A, Gaussian mixture modeling identified two distributions in the spectra score. Each patient was assigned into an OS risk group based on distribution probability. The blue bars on the x-axis represent patients in the low-risk group and the orange bars are patients in the high-risk group. B, Kaplan–Meier curve of the high-/low-risk groups. High-risk patients (n = 88, 58 events) had median survival of 20.1 months. Low-risk patients (n = 679, 121 events) did not reach median survival in the study timeframe. In the GSE2658 replication data: C, Two risk groups were identified using the OS spectra scores. D, High-risk patients (n = 43, 20 events) had median survival of 20.1 months. Low-risk patients (n = 516, 80 events) did not reach median survival in the study timeframe.

Figure 4.

OS spectra predictive modeling. CoMMpass derivation data. A, Gaussian mixture modeling identified two distributions in the spectra score. Each patient was assigned into an OS risk group based on distribution probability. The blue bars on the x-axis represent patients in the low-risk group and the orange bars are patients in the high-risk group. B, Kaplan–Meier curve of the high-/low-risk groups. High-risk patients (n = 88, 58 events) had median survival of 20.1 months. Low-risk patients (n = 679, 121 events) did not reach median survival in the study timeframe. In the GSE2658 replication data: C, Two risk groups were identified using the OS spectra scores. D, High-risk patients (n = 43, 20 events) had median survival of 20.1 months. Low-risk patients (n = 516, 80 events) did not reach median survival in the study timeframe.

Close modal
Table 1.

OS HRs between high- and low-risk patients using spectra or UAMS to define patient groups.

Risk scoreApparent HRAdjusted HROptimismReplication HR
OS spectra 6.44 (4.47–14.86) 4.27 (2.31–12.69) 2.16 4.43 (1.85–10.61) 
UAMS 3.98 (2.62–5.54) 3.89 (2.53–5.45) 0.09 3.61 (1.98–6.61) 
Risk scoreApparent HRAdjusted HROptimismReplication HR
OS spectra 6.44 (4.47–14.86) 4.27 (2.31–12.69) 2.16 4.43 (1.85–10.61) 
UAMS 3.98 (2.62–5.54) 3.89 (2.53–5.45) 0.09 3.61 (1.98–6.61) 

The OS spectra score provided added predictive value (APV) beyond known clinical risk factors: Revised International Staging System (R-ISS) and age at diagnosis (APV = 0.612; LRT P = 2.36 × 10−20; Table 2), indicating the OS spectra risk score contains predictive information beyond clinical factors. In parallel analysis, UAMS APV was 0.576 (Table 2). OS spectra score also showed APV beyond a model with R-ISS, age at diagnosis and UAMS (APV = 0.109; P = 7.39 × 10−5). These indicate new value beyond the previously established UAMS risk score.

Table 2.

OS Cox models including covariates and added predictive values.

VariableBase modelΔAICΔBICAPVLRT P value
OS spectra R-ISS + Age −83.5 −80.7 0.612 2.36 × 10−20 
UAMS R-ISS + Age −71.7 −68.9 0.576 9.20 × 10−18 
OS spectra R-ISS + Age + UAMS −13.7 −10.9 0.109 7.39 × 10−5 
VariableBase modelΔAICΔBICAPVLRT P value
OS spectra R-ISS + Age −83.5 −80.7 0.612 2.36 × 10−20 
UAMS R-ISS + Age −71.7 −68.9 0.576 9.20 × 10−18 
OS spectra R-ISS + Age + UAMS −13.7 −10.9 0.109 7.39 × 10−5 

Abbreviations: R-ISS, Revised international staging system, age at diagnosis; ΔAIC, change in Akaike's information criterion by adding the variable; ΔBIC, change in Bayesian information criterion by adding the variable; APV, variable added predictive value of the variable; LRT P Value, LRT P-value between the base model with and without the variable.

Spectra identified high-risk patients with short OS that UAMS classified as low risk (Supplementary Fig. S4). Although both spectra and UAMS identified 61 patients as high-risk with median survival of 16.2 months (44 events, spectra identified 27 additional high-risk patients (median survival 26.7 months, 14 events) that UAMS identified as low risk. In contrast, UAMS identified 35 patients as high-risk that spectra identified as low risk: these patients remained above median survival in the study timeframe (Supplementary Fig. S4).

In external data (GSE2658), the OS spectra risk score retained predictive value. The same linear transformations for spectra calculation were applied to GSE2658, and the nine spectra prediction model used to calculate OS spectra risk scores. Using the GMM threshold from the CoMMpass data again identified two groups in GSE2658 (Fig. 4C); the high-risk group with significantly worse OS (Fig. 4D). The OS spectra risk HRs for the high-risk compared with low-risk patients in the GSE2658 data was 4.43, 95% CI, 1.85–10.61, out-performing UAMS (HR, 3.61; 95% CI, 1.98–6.61) even though UAMS was originally developed on these data (Table 1).

Spectra can be used to track multiple myeloma tumor changes over time to predict OS

To illustrate a potential to track changes over time, the OS spectra risk score was graphed for both initial and longitudinal follow-up samples (Fig. 5). Eleven patients had at least three follow-up CD138+ samples, spanning up to 6 years. Spectra scores were calculated in each of the follow-up samples and tracked over time (Fig. 5). Patients with tumors that acquired an OS spectra risk score above the high-risk threshold had worse survival (Fig. 5).

Figure 5.

Aggregate effect of CD138+ spectra for OS over time. Spectra scores for OS are shown in 11 patients with RNA-seq data at multiple time points. Diamonds indicate sequencing events and show the spectra score at that timepoint. The final narrow rectangle indicates timepoint the patient died (filled) or was last known alive (open). Gray horizontal line shows the high-/low-risk group threshold.

Figure 5.

Aggregate effect of CD138+ spectra for OS over time. Spectra scores for OS are shown in 11 patients with RNA-seq data at multiple time points. Diamonds indicate sequencing events and show the spectra score at that timepoint. The final narrow rectangle indicates timepoint the patient died (filled) or was last known alive (open). Gray horizontal line shows the high-/low-risk group threshold.

Close modal

As unsupervised and quantitative variables, spectra can predict additional multiple myeloma outcomes

The same 39 spectra variables can be used to model any outcome. We repeated the predictive modeling procedure for PFS and TTF. Cox regression of PFS (392 events) selected nine spectra in a model with Cadj = 0.60 (0.60–0.67). Four of the retained spectra were distinct from the OS spectra model (Supplementary Table S1). Gaussian mixture modeling identified two risk groups (P = 0.001; Supplementary Figs. S5A and S5B). Median time to progression was 9.7 months in the high-risk patients (n = 60, 50 events) and 35.7 months in the low-risk patients (n = 707, 342 events; Supplementary Fig. S5C). Between high- and low-risk groups, spectra PFS HRadj = 3.08 (1.67–8.44) whereas UAMS PFS HRadj = 2.40 (1.70–3.31; Supplementary Table S2). Spectra added information beyond R-ISS and age at diagnosis (APV = 0.67; LRT P = 4.3 × 10−19) in predicting PFS (Supplementary Table S3).

In Cox regression modeling to predict TTF (369 events), 10 spectra were selected with the model Cadj = 0.60 (0.59–0.66). Four of the retained spectra were distinct from the OS and PFS models (Supplementary Table S1). Two risk groups were identified in the TTF spectra score with GMM (P = 0.008; Supplementary Figs. S6A and S6B). Patients in the spectra high-risk TTF (n = 31, 25 events) had median TTF of 9.2 months compared with low-risk patients (n = 736, 344 events) with median TTF of 32.8 months (Supplementary Fig. S6C). HRs between high- and low-risk groups using TTF spectra risk groups [HRadj, 3.10 (1.31–5.46)] outperformed UAMS [TTF HRadj, 1.98 (1.48–2.68); Supplementary Table S4]. Spectra provided additional information in predicting TTF beyond R-ISS, and age at diagnosis (APV = 0.60, LRT P = 4.1×10−14; Supplementary Table S5). Together these results illustrate spectra capture the multiple myeloma tumor expression landscape and can be used in subsequent modeling with many clinical outcomes.

Spectrum S5 is enriched for genes in the unfolded protein response pathway

We implemented pre-ranked GSEA (30) for all spectra using the Hallmark pathways (29) to gain biological insight. Pathways with significant enrichment, after multiple testing correction, are shown in Supplementary Fig. S7 for each spectrum. Broadly, we observed enrichment in seven of the Hallmark Process Categories (29) with immune, signaling, and proliferation pathways most often significant (Supplementary Fig. S7). As a specific example, spectrum S5 was associated with all three clinical outcomes and significantly enriched for genes in the unfolded protein response (UPR) pathway (normalized enrichment score, NES = 2.25; Padj = 2.5×10−8). The UPR pathway is activated when incorrectly folded proteins create stress in the endoplasmic reticulum (ER; ref. 31). These degenerate proteins are targeted for degradation by the proteasome (31, 32). Bortezomib is a proteasome inhibitor and hence causes build-up of misfolded proteins, increasing ER stress leading to cell-death (32). Myeloma cells secrete large amounts of incorrectly folded immunoglobulins and therefore are near capacity for UPR (31). Because of this, CD138+ myeloma cells are particularly sensitive to proteasome inhibitors, such as Bortezomib, a common agent in the initial treatment for multiple myeloma (31, 32). An intriguing hypothesis consistent with these findings is that spectrum S5 may tag a patient's sensitivity to proteasome inhibition agents, thus influencing OS, PFS, and TTF. Future functional studies are needed to validate this relationship.

The goal of this study was to develop an approach and provide agnostic quantitative molecular phenotypes (spectra variables) to deeply profile transcriptome data. We described the SPECTRA approach in multiple myeloma. We identified 39 quantitative, orthogonal variables to capture sources of multiple myeloma transcriptome variation and used these spectra in subsequent descriptive and predictive modeling and as quantitative phenotypes. We demonstrated the utility of spectra in predicting OS beyond clinical risk factors, existing risk scores, and showed novel associations with demographic risk factors.

The SPECTRA approach provides a measured dive into the transcriptome. Each spectrum iteratively moves quantifiably deeper into the variance of the data. Methods that iteratively find independent components (PCA and independent component analysis) have previously been shown to provide superior coverage of transcriptome data (8). The importance of a deeper dive is illustrated in the multiple myeloma predictive modeling. Spectra representing variance deeper in the data were important in predicting OS; for example, S19 and S26 (both with variance <1%) are in the OS spectra risk score. In multiple myeloma, retention of components deep in the data, representing small variances (i.e., deep dives) improved prediction. Deep transcriptome characterization may be valuable as a new tool with the potential to identify the few patients that respond to a drug, or small groups of individuals with large effects in outcome studies, both relevant to precision medicine.

In multiple myeloma, we illustrated the implementation of 39 CD138+ spectra in predictive modeling for OS, PFS, and TTF. We showed added value beyond the most established risk score (UAMS) and clinical risk factors (R-ISS, age at diagnosis). The framework of 39 quantitative variables has the potential to provide a bridge to compare many patient or tissue characteristics and could be used to compare existing categorizations (such as subtypes) of patients, even when no genes overlap in their signatures, or they predict different outcomes (33). In this way, spectra provide an alternate to categorical intrinsic subtyping, a well-established practice for many cancers (34).

The potential for increased power using spectra variables is illustrated by the discovery of novel associations between spectra and patient demographic risk groups with known differences in incidence (age, gender, race). A prior CoMMpass study whose goal was to identify differences by race for myeloma tumors used the UAMS score to compare transcriptomes by race and did not identify significant differences (P = 0.662; ref. 35). In contrast, our framework identified 13 spectra that differed significantly by race. Our multivariable results demonstrate that significant differences do exist, but also illustrate that the diseased cells in these demographic groups are not distinct entities; fewer than half the spectra variables differ significantly by these patient demographic groups. Focusing on the spectra that do show differences by demographics provides new avenues to explore why incidence varies in these groups, a key to disease prevention, intervention, and control. Because transcriptomes capture both the effects of internal (inherited genetics) and external factors (lifestyle, exposures, consequences of access to care), these results support epidemiology and biosociology investigations into such differences. We have provided the variable framework (gene transformations) and the spectra variables for the CoMMpass patients (see Data and Code Availability) to enable further study of spectra in other CoMMpass studies, as well as in other myeloma studies.

As for any approach, there are limitations. An investigator should consider if a derivation dataset has adequate representation for their study goals. We note that the goal of the MMRF CoMMpass study was designed to represent myeloma patients from diagnosis through treatment and is the largest existing cohort of treatment naïve CD138+ transcriptomes, with sampling continuing over time. To limit overfitting in spectra derivation, we used dimension reduction to focus only on the first k spectra (largest k components of variation), selected using a screen test (17). Data quality and processing are paramount to derive informative agnostic variables. PCA is a procedure that provides linear transformations of the data to represent variance. If the data have technical artifacts, batch effects, unstable or non-comparable expression measures, the noise can overwhelm authentic variance. Accordingly, the SPECTRA approach intentionally includes strict quality control, careful zero-handling, robust normalization, and batch correction. Without these steps, PCA can fail to provide informative variables. These stringent rules will undoubtably also remove some genes that may be biologically meaningful. For example, driver genes may be completely suppressed in a substantial proportion of the patients’ tumors and thus removed. However, such biologically meaningful genes likely generate a cascade of changes in other genes retained the gene set. In this way, spectra can retain the signal of removed genes through their downstream effects. The impetus is to only retain features that can contribute to meaningful variance and provide informative variables for modeling. The limitation of an agnostic approach is reduced biological interpretation or mechanistic insight of the variables themselves, prior to modeling. However, there are already many approaches that take this alternate goal of intermediate interpretation (30), whose limitations are instead the reduced flexibility of the variables they produce. Hence, SPECTRA is a complementary approach to the current toolset available for all fields.

In conclusion, we present a new approach, SPECTRA, to derive an agnostic transcriptome framework of quantitative, orthogonal variables for a dataset. These multigene expression variables are designed specifically to capture transcriptome variation, providing phenotypes for flexible modeling, along with other covariates, to better differentiate individuals for any outcome. Applied to CD138+ transcriptomes for patients with myeloma, we defined CD138+ spectra and implemented these in many different outcome models. We illustrated an ability to predict prognostic outcomes and provide new insight into potential differences between tumors and patients from demographic groups. Fundamentally, the technique shifts from characterizing a transcriptome using categories to multiple quantitative measures. The SPECTRA approach provides a new paradigm and tool for exploring transcriptomes that hold promise for discoveries to advance precision screening, prevention, intervention, and survival studies.

H.A. Hanson reports grants from NIH Academic Career Development Awards during the conduct of the study. B.J. Avery reports grants from NCI during the conduct of the study. D.W. Sborov reports personal fees from GSK, Pfizer, Arcellx, AbbVie, Janssen, and Sanofi outside the submitted work. N.J. Camp reports grants from NCI during the conduct of the study. No disclosures were reported by the other authors.

R. Griffin: Formal analysis, funding acquisition, validation, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing. H.A. Hanson: Methodology, writing–review and editing. B.J. Avery: Software, formal analysis, validation, visualization, writing–review and editing. M.J. Madsen: Software, methodology, writing–review and editing. D.W. Sborov: Resources, writing–review and editing. N.J. Camp: Conceptualization, resources, data curation, supervision, funding acquisition, investigation, methodology, writing–original draft, project administration, writing–review and editing.

The research reported in this publication was supported by the NCI [award Nos. F99CA234943 (to R. Griffin), K00CA234943 (to R. Griffin), K07CA230150 (to H.A. Hanson), and P30CA042014–29S9], the National Center for Advancing Translational Sciences (award No. UL1TR002538), and the National Library of Medicine (award No. T15LM007124, to R. Griffin) of the NIH. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We acknowledge the data resources utilized from the Multiple Myeloma Research Foundation (MMRF) and are grateful to the participants in the MMRF CoMMpass Study.

The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.

Note: Supplementary data for this article are available at Cancer Epidemiology, Biomarkers & Prevention Online (http://cebp.aacrjournals.org/).

1.
Allott
EH
,
Shan
Y
,
Chen
M
,
Sun
X
,
Garcia-Recio
S
,
Kirk
EL
, et al
.
Bimodal age distribution at diagnosis in breast cancer persists across molecular and genomic classifications
.
Breast Cancer Res Treat
2020
;
179
:
185
95
.
2.
Lopez
C
,
Kleinheinz
K
,
Aukema
SM
,
Rohde
M
,
Bernhart
SH
,
Hubschmann
D
, et al
.
Genomic and transcriptomic changes complement each other in the pathogenesis of sporadic Burkitt lymphoma
.
Nat Commun
2019
;
10
:
1459
.
3.
Stopsack
KH
,
Ebot
EM
,
Downer
MK
,
Gerke
TA
,
Rider
JR
,
Kantoff
PW
, et al
.
Regular aspirin use and gene expression profiles in prostate cancer patients
.
Cancer Causes Control
2018
;
29
:
775
84
.
4.
Sweeney
C
,
Bernard
PS
,
Factor
RE
,
Kwan
ML
,
Habel
LA
,
Quesenberry
CP
Jr
, et al
.
Intrinsic subtypes from PAM50 gene expression assay in a population-based breast cancer cohort: differences by age, race, and tumor characteristics
.
Cancer Epidemiol Biomarkers Prev
2014
;
23
:
714
24
.
5.
Zhang
M
,
Lykke-Andersen
S
,
Zhu
B
,
Xiao
W
,
Hoskins
JW
,
Zhang
X
, et al
.
Characterising cis-regulatory variation in the transcriptome of histologically normal and tumour-derived pancreatic tissues
.
Gut
2018
;
67
:
521
33
.
6.
Brunet
JP
,
Tamayo
P
,
Golub
TR
,
Mesirov
JP
.
Metagenes and molecular pattern discovery using matrix factorization
.
Proc Natl Acad Sci U S A
2004
;
101
:
4164
9
.
7.
Tamayo
P
,
Slonim
D
,
Mesirov
J
,
Zhu
Q
,
Kitareewan
S
,
Dmitrovsky
E
, et al
.
Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation
.
Proc Natl Acad Sci U S A
1999
;
96
:
2907
12
.
8.
Way
GP
,
Zietz
M
,
Rubinetti
V
,
Himmelstein
DS
,
Greene
CS
.
Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations
.
Genome Biol
2020
;
21
:
109
.
9.
Lapointe
J
,
Li
C
,
Higgins
JP
,
van de Rijn
M
,
Bair
E
,
Montgomery
K
, et al
.
Gene expression profiling identifies clinically relevant subtypes of prostate cancer
.
Proc Natl Acad Sci U S A
2004
;
101
:
811
6
.
10.
Perou
CM
,
Sorlie
T
,
Eisen
MB
,
van de Rijn
M
,
Jeffrey
SS
,
Rees
CA
, et al
.
Molecular portraits of human breast tumours
.
Nature
2000
;
406
:
747
52
.
11.
Shaughnessy
JD
Jr.
,
Zhan
F
,
Burington
BE
,
Huang
Y
,
Colla
S
,
Hanamura
I
, et al
.
A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1
.
Blood
2007
;
109
:
2276
84
.
12.
Madsen
MJ
,
Knight
S
,
Sweeney
C
,
Factor
R
,
Salama
M
,
Stijleman
IJ
, et al
.
Reparameterization of PAM50 expression identifies novel breast tumor dimensions and leads to discovery of a genome-wide significant breast cancer locus at 12q15
.
Cancer Epidemiol Biomarkers Prev
2018
;
27
:
644
52
.
13.
Camp
NJ
,
Madsen
MJ
,
Herranz
J
,
Rodriguez-Lescure
A
,
Ruiz
A
,
Martin
M
, et al
.
Re-interpretation of PAM50 gene expression as quantitative tumor dimensions shows utility for clinical trials: application to prognosis and response to paclitaxel in breast cancer
.
Breast Cancer Res Treat
2019
;
175
:
129
39
.
14.
Hanson
HA
,
Leiser
CL
,
Madsen
MJ
,
Gardner
J
,
Knight
S
,
Cessna
M
, et al
.
Family study designs informed by tumor heterogeneity and multi-cancer pleiotropies: the power of the Utah population database
.
Cancer Epidemiol Biomarkers Prev
2020
;
29
:
807
15
.
15.
Keats
JJ
,
Craig
DW
,
Liang
W
,
Venkata
Y
,
Kurdoglu
A
,
Aldrich
J
, et al
.
Interim analysis of the MMRF Commpass Trial, a longitudinal study in multiple myeloma relating clinical outcomes to genomic and immunophenotypic profiles
.
Blood
2013
;
122
:
532
-.
16.
Johnson
WE
,
Li
C
,
Rabinovic
A
.
Adjusting batch effects in microarray expression data using empirical Bayes methods
.
Biostatistics
2007
;
8
:
118
27
.
17.
Cattell
RB
.
The Scree test for the number of factors
.
Multivariate Behav Res
1966
;
1
:
245
76
.
18.
Rajkumar
SV
,
Kumar
S
.
Multiple myeloma current treatment algorithms
.
Blood Cancer J
2020
;
10
:
94
.
19.
Therneau
TM
.
A package for survival analysis in R
2023 Mar 11. Available from
: https://cran.r-project.org/web/packages/survival/vignettes/survival.pdf.
20.
Therneau
TM
,
Grambsch
PM
.
Modeling survival data: extending the Cox model
.
New York, NY
:
Springer New York
;
2000
.
21.
Fraley
C
,
Raftery
AE
.
Model-based clustering, discriminant analysis, and density estimation
.
J Am Statist Assoc
2002
;
97
:
611
31
.
22.
Scrucca
L
,
Fop
M
,
Murphy
TB
,
Raftery
AE
.
mclust 5: clustering, classification and density estimation using Gaussian finite mixture models
.
R J
2016
;
8
:
289
317
.
23.
Armitage
P
,
Berry
G
,
Matthews
JNS
,
ProQuest
.
Statistical methods in medical research
.
Oxford, Boston
:
Blackwell Scientific Publications
;
2002
.
24.
Harrell
FE
.
Regression modeling strategies : with applications to linear models, logistic and ordinal regression, and survival analysis
.
Cham
:
Springer
;
2015
.
25.
Noma
H
,
Shinozaki
T
,
Iba
K
,
Teramukai
S
,
Furukawa
TA
.
Confidence intervals of prediction accuracy measures for multivariable prediction models based on the bootstrap-based optimism correction methods
.
Stat Med
2021
;
40
:
5691
701
.
26.
Al-Radi
OO
,
Harrell
FE
Jr
,
Caldarone
CA
,
McCrindle
BW
,
Jacobs
JP
,
Williams
MG
, et al
.
Case complexity scores in congenital heart surgery: a comparative study of the aristotle basic complexity score and the Risk Adjustment in Congenital Heart Surgery (RACHS-1) system
.
J Thorac Cardiovasc Surg
2007
;
133
:
865
75
.
27.
Califf
RM
,
Phillips
HR
, 3rd,
Hindman
MC
,
Mark
DB
,
Lee
KL
,
Behar
VS
, et al
.
Prognostic value of a coronary artery jeopardy score
.
J Am Coll Cardiol
1985
;
5
:
1055
63
.
28.
Davis
S
,
Meltzer
PS
.
GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor
.
Bioinformatics
2007
;
23
:
1846
7
.
29.
Liberzon
A
,
Birger
C
,
Thorvaldsdottir
H
,
Ghandi
M
,
Mesirov
JP
,
Tamayo
P
.
The Molecular Signatures Database (MSigDB) hallmark gene set collection
.
Cell Syst
2015
;
1
:
417
25
.
30.
Subramanian
A
,
Tamayo
P
,
Mootha
VK
,
Mukherjee
S
,
Ebert
BL
,
Gillette
MA
, et al
.
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
.
Proc Natl Acad Sci U S A
2005
;
102
:
15545
50
.
31.
Ri
M
.
Endoplasmic-reticulum stress pathway-associated mechanisms of action of proteasome inhibitors in multiple myeloma
.
Int J Hematol
2016
;
104
:
273
80
.
32.
Fricker
LD
.
Proteasome inhibitor drugs
.
Annu Rev Pharmacol Toxicol
2020
;
60
:
457
76
.
33.
Szalat
R
,
Avet-Loiseau
H
,
Munshi
NC
.
Gene expression profiles in myeloma: ready for the real world?
Clin Cancer Res
2016
;
22
:
5434
42
.
34.
Dai
X
,
Li
T
,
Bai
Z
,
Yang
Y
,
Liu
X
,
Zhan
J
, et al
.
Breast cancer intrinsic subtype classification, clinical use and future trends
.
Am J Cancer Res
2015
;
5
:
2929
43
.
35.
Manojlovic
Z
,
Christofferson
A
,
Liang
WS
,
Aldrich
J
,
Washington
M
,
Wong
S
, et al
.
Comprehensive molecular profiling of 718 multiple myelomas reveals significant differences in mutation frequencies between African and European descent cases
.
PLoS Genet
2017
;
13
:
e1007087
.
This open access article is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.

Supplementary data