Abstract
We assessed the predictive potential of positron emission tomography (PET)/CT-based radiomics, lesion volume, and routine blood markers for early differentiation of pseudoprogression from true progression at 3 months.
112 patients with metastatic melanoma treated with immune checkpoint inhibition were included in our study. Median follow-up duration was 22 months. 716 metastases were segmented individually on CT and 2[18F]fluoro-2-deoxy-D-glucose (FDG)-PET imaging at three timepoints: baseline (TP0), 3 months (TP1), and 6 months (TP2). Response was defined on a lesion-individual level (RECIST 1.1) and retrospectively correlated with FDG-PET/CT radiomic features and the blood markers LDH/S100. Seven multivariate prediction model classes were generated.
Two-year (median) overall survival, progression-free survival, and immune progression–free survival were 69% (not reached), 24% (6 months), and 42% (16 months), respectively. At 3 months, 106 (16%) lesions had progressed, of which 30 (5%) were identified as pseudoprogression at 6 months. Patients with pseudoprogressive lesions and without true progressive lesions had a similar outcome to responding patients and a significantly better 2-year overall survival of 100% (30 months), compared with 15% (10 months) in patients with true progressions/without pseudoprogression (P = 0.002). Patients with mixed progressive/pseudoprogressive lesions were in between at 53% (25 months). The blood prediction model (LDH+S100) achieved an AUC = 0.71. Higher LDH/S100 values indicated a low chance of pseudoprogression. Volume-based models: AUC = 0.72 (TP1) and AUC = 0.80 (delta-volume between TP0/TP1). Radiomics models (including/excluding volume-related features): AUC = 0.79/0.78. Combined blood/volume model: AUC = 0.79. Combined blood/radiomics model (including volume-related features): AUC = 0.78. The combined blood/radiomics model (excluding volume-related features) performed best: AUC = 0.82.
Noninvasive PET/CT-based radiomics, especially in combination with blood parameters, are promising biomarkers for early differentiation of pseudoprogression, potentially avoiding added toxicity or delayed treatment switch.
Immune checkpoint inhibitors (ICI) have revolutionized the treatment of patients with metastatic melanoma. However, more than 50% of patients do not respond to ICI. ICI response assessment is challenging, as novel response patterns, such as pseudoprogression (PP) are not considered in the response evaluation criteria in solid tumors (RECIST 1.1). An increase in tumor volume could be based on either true progressive disease (TPD) or on influx of immune-competent cells (PP). Early differentiation of PP and TPD is highly relevant in daily clinical decision-making, and predictive biomarkers are needed for better patient selection. We could identify 2[18F]fluoro-2-deoxy-D-glucose–positron emission tomography/CT-based radiomic and delta-radiomic features as novel imaging markers for early differentiation of PP from TPD. In addition, we could show that the routine blood markers LDH and S100 can contribute to PP prediction. A multimodality approach of combined radiomics and blood marker–based prediction model at an early time-point of 3 months yielded the best performance. Thereby, added toxicity or delayed treatment switch in patients with metastatic melanoma treated with ICI might be potentially avoided.
Introduction
Immune checkpoint inhibitors (ICI) have revolutionized the treatment of patients with metastatic melanoma and are guideline-recommended treatment standards (1–3). However, more than 50% of patients do not respond to ICI (4). Predictive biomarkers are needed for better patient selection. Lactate dehydrogenase (LDH) is one of the only established melanoma biomarkers with prognostic value for overall survival (OS; ref. 5). S100 is associated with OS and response to ipilimumab, whereas other blood-markers and clinical markers yielded mixed results (6–10). PD-L1 expression is insufficient for patient selection, as both PD-L1–positive and negative patients benefit from ICI (3). Tumor mutational burden (TMB), T-cell clonality, circulating tumor DNA (ctDNA), immune gene signatures, as well as T-cell–inflamed gene expression profile are promising, but currently unavailable in routine practice (11–19). Most of these factors do not have an application in response assessment and require at least minimally invasive diagnostics, rendering them challenging for repeated analysis.
ICI response assessment is especially challenging, as the response evaluation criteria in solid tumors (RECIST 1.1) do not consider novel response patterns, such as pseudoprogression (PP; refs. 20, 21). In the setting of ICI, a lesion enlargement could be caused by either true progressive disease (TPD) because of tumor growth or an influx of immune-competent cells, representing an effective antitumor immune response or pseudoprogression (22). Early differentiation of PP and TPD is highly relevant in daily clinical decision-making. The immune-related response criteria (irRC; ref. 23) were the first to include these phenomena, followed by iRECIST (24). Both rely on confirmation follow-up imaging with potentially negative consequences. A treatment might be changed early because PP could be misinterpreted as TPD using only RECIST criteria or a switch to an effective treatment alternative could be delayed while waiting for confirmatory follow-up imaging. With newly proposed PECRIT (25) and PERCIMT (26) criteria, lesions increasing in size or the appearance of new lesions do not necessarily imply true progression, but may also be attributed to PP. However, these criteria have not been validated in larger cohorts.
Noninvasive imaging is performed repetitively during the treatment course for continuous response assessment. Manual image assessment is, however, characterized by suboptimal accuracy as well as intraobserver and interobserver variability (27, 28). Consequently, there is a strong rationale for quantitative medical imaging analysis (i.e., radiomics). Recently, Sun and colleagues demonstrated a correlation between CT radiomic signatures and the molecular CD8 cell expression in patients with different solid tumors treated with ICI, discriminating inflamed tumors from noninflamed tumors, which was associated with higher response rates at 3 and 6 months, as well as higher OS (29).
Our study aimed to identify novel imaging markers and blood markers for early differentiation of PP from TPD in patients with metastatic melanoma treated with anti–PD-1 antibodies. The predictive potential of early single-timepoint radiomics and lesion volume as well as multi-timepoint delta radiomics and delta volume was assessed on a lesion-individual level using positron emission tomography (PET)/CT imaging. In addition, the routine blood markers LDH and S100 were assessed for PP prediction on a patient-individual level.
Materials and Methods
Patient cohort
This is a single institution analysis of a deeply characterized cohort of 190 patients with metastatic melanoma treated with either single checkpoint inhibition (anti–PD-1) or dual checkpoint inhibition (anti–PD-1/anti–CTLA-4) between 2013 and 2019. Written informed consent was obtained from all patients and the study was approved by the local ethics committee (Kantonale Ethikkommission Zürich, approval number 2019-01012) in accordance with “good clinical practice” (GCP) guidelines and the Declaration of Helsinki.
The following exclusion criteria were applied to allow for a standardized radiomics and outcome analysis: lack of follow-up/baseline imaging; patients with only contrast-enhanced CT imaging (as most patients were staged/followed with nonenhanced PET/CT imaging); patients with only brain metastases; and patients presenting with only very small metastases at baseline (all baseline lesions <0.5 cc). The last exclusion criterion is based on multiple factors. Statistical comparisons between individual voxels of a defined lesion (e.g., radiomics analysis) can only be performed if an adequate number of voxels is present in the defined volume. This is limited by the resolution of the underlying imaging technology and the voxel size itself. In addition, all imaging-based analysis methods do require a minimum lesion size to perform measurements with sufficient accuracy and reliability.
Endpoints
On a lesion-individual level, response was defined using RECIST 1.1 criteria, comparing lesion diameter at three different timepoints (TP): baseline (TP0), first follow-up at 3 months (TP1), and second follow-up at 6 months (TP2). PP was defined as a diameter increase by ≥20% at TP1, followed by a decrease to <20% at TP2 compared with TP0. TPD was defined as an increase by ≥20% on both TP1 and TP2 compared with TP0.
The distribution of PP and TPD lesions was analyzed on a patient-level and all patients were classified into: (i) patients with ≥1 PP lesion and no TPD lesions (PP only); (ii) patients with ≥1 TPD and no PP lesions (TPD only); (iii) patients presenting with both PP lesions and TPD lesions (mixed PP and TPD); and (iv) all other patients, who did not have a progressive lesion at TP1. For the lesion-level analysis and radiomics analysis, the appearance of new lesions was not taken into account in the patient stratification, as only lesions available at all three timepoints were considered for these analyses. OS was defined as the time from ICI treatment initiation to the date of death. Progression-free survival (PFS) and immune PFS (iPFS) were defined as the time from ICI treatment initiation to the date of first progression/appearance of new lesions (PFS, RECIST; ref. 20), or confirmed progression/appearance of new lesions (iPFS, iRECIST; ref. 24).
Blood markers
LDH and S100 levels were measured at TP0 and TP1 and their relative change between TP1 and TP0 was calculated.
Imaging and lesion delineation
All imaging was performed at a single institution using standardized imaging protocols. All subjects were injected with a body weight–dependent and/or BMI-adapted 2[18F]fluoro-2-deoxy-D-glucose (FDG) dose (2.0–3.5 MBq per kg). Scanning was performed on different scanners, partly with time-of-flight acquisition. PET image reconstructions used ordered subset expectation maximization together with point spread function modeling where available. The CT acquisition parameters were almost identical for all scanners and have been described previously in detail (30).
On the basis of exclusion criteria, 112 patients with CT imaging for all three timepoints were included. PET imaging for all timepoints was available for 90 of these patients (80%). All lesions were manually segmented by two experienced clinicians based on a common protocol and consistent quality control at all timepoints (Fig. 1). A validation was performed for about 10% of all lesions, visually comparing the independently contoured lesions, demonstrating reproducibility across all lesion locations.
CT and PET images were coregistered rigidly and CT-based contours were propagated to the PET images. Any spatial geographic mismatch was manually corrected via shifting the CT-based contours to the corresponding lesion location in PET images. All lesions were classified into liver, lung, bone, or soft tissue (including lymph nodes, cutaneous/subcutaneous, and muscular lesions) metastases.
Extraction of radiomic features
The in-house developed radiomics software Z-Rad (31) written in Python programming language was used for preprocessing and extraction of radiomic features from medical images in agreement with the image biomarker standardization initiative (32). CT and PET images were resized to isotropic voxels of size 3.75 mm and 5 mm, respectively, corresponding to the lowest image resolution (image slice thickness) of the whole cohort. Three types of features, describing shape, intensity, and texture were extracted. No Hounsfield unit range limits were applied. As a preprocessing step before texture features extraction, image intensity values were discretized using a quantization step of 5 HU for CT and 0.25 SUV for PET. In total, 172 features per lesion were extracted for each imaging modality. The features were extracted for images taken at TP0 and TP1. The reproducibility of radiomic features was independently confirmed for a subset of lesions. Next, the relative (%) change in feature values between these two timepoints was calculated which gave rise to delta radiomic features.
Statistical analysis
To differentiate progressive lesions at TP1 into PP and TPD, a total of seven model classes were considered, including multimodality approaches:
(i) Blood: models based only on blood markers (LDH and S100).
(ii) Volume: models based only on the volume of individual metastases.
(iii) Radiomics: (including volume-related features)—models based on radiomic features.
(iv) Radiomics: (excluding volume-related features)—models based on radiomic features. Lesion volume and radiomic features correlated to lesion volume above Pearson r = 0.5 were excluded.
(v) Blood and volume: models based on blood markers and lesion volume.
(vi) Blood and radiomics: (excluding volume-related features)—models based on blood markers and on radiomic features. Lesion volume and radiomic features correlated to lesion volume above Pearson r = 0.5 were excluded.
(vii) Blood and radiomics: (including volume-related features)—models based on blood markers and on radiomic features.
The model building procedure consisted of five steps. First, an unsupervised feature selection with the Pearson correlation coefficient was performed. In this procedure, pairwise correlation coefficients were calculated for all features and the correlated features were removed. Second, all features were scaled by subtracting the mean and dividing by the SD. Third, a univariate supervised feature selection was done. For each feature, an F-test was performed; only features significant after the false discovery rate correction with the Benjamini–Hochberg (33) procedure were kept. This step was followed by feature selection based on feature weights from a fitted model. For this reason, a logistic regression model regularized with elastic net was fit to the data. Features with the highest weights were selected. Finally, the classifier was fit to the data. This final model was based on logistic regression and regularized with L2 penalty. The hyperparameters that were tuned are listed in Supplementary Table S1.
Our models were trained, optimized, and tested in a setting of a nested cross-validation. The inner loop, used for model tuning, was a three times repeated stratified 10-fold cross-validation. In total, 96 randomly generated hyperparameter samples were evaluated in model tuning with random search optimization (34). The outer loop, used for model testing, was a three times repeated stratified 5-fold cross-validation. The performance metric used was the area under the receiver operating characteristic curve (AUC).
The optimal cutoff was estimated for every model from the corresponding averaged ROC curve. The averaged ROC curves were calculated on the basis of 15 ROC curves (3 × 5) from the outer loop of the nested cross-validation. The optimal cutoff was defined as the sensitivity and specificity at which the Youden index (sensitivity + specificity − 1) was maximal (35). For the optimal cutoff, sensitivity, specificity, predictive values, and likelihood ratios with corresponding confidence intervals (CI) were estimated.
OS, PFS, and iPFS were assessed and compared among PP-only, TPD-only, mixed PP and TPD groups, and all patients that did not progress at TP1. The landmark analysis method (36) has been used to avoid the guarantee-time bias. The landmark was set at the determination timepoint of PP and true progression (TP2). To avoid excluding patients who had their TP2 follow-up examination slightly earlier than 6 months after the onset of treatment, we chose 5 months for the landmark. Family-wise error rate (FWER) was controlled at the 0.05 level with the Holm–Bonferroni method for each type of survival (37).
Tumor burden, approximated by total tumor volume of metastatic lesions in a patient, was estimated at TP0, TP1, and TP2. New lesions that appeared at TP1 or TP2 were not considered in the estimation. Statistical significance of the differences between the groups was estimated with Mann–Whitney U test.
Software
For visualization, statistical analysis, model building, and model testing, the following open-source Python packages were used: HoloViews, Lifelines, Matplotlib, NumPy & SciPy, Pandas, and Scikit-learn.
Results
Patient characteristics
On the basis of our exclusion criteria, 112 patients with a total of 716 metastases were included. A total of 2,061 metastases for both CT and PET imaging were individually segmented and analyzed. 645 (90%) of the 716 baseline lesions were either visible at all three timepoints or had a complete remission at TP1 or TP2. The remaining 71 lesions had to be excluded due to either surgical removal or lack of follow-up imaging.
The analysis of the lesion locations, as well as the patient characteristics analysis, was performed at baseline (TP0) and is therefore based on the 716 baseline metastases. The most frequent location was in soft tissue with 378 (52.8%) of all 716 baseline lesions. Other locations included 161 (22.5%) lung, 128 (17.9%) liver/spleen, 47 (6.6%) bone, and 2 (0.3%) myocardial lesions. Table 1 describes the patient characteristics for all groups.
. | . | Patients with progressive lesions at TP1 . | ||
---|---|---|---|---|
Patient characteristics . | All patients . | PP-only . | TPD-only . | Mixed PP & TPD . |
General | ||||
Total (n, %) | 112 (100%) | 9 (8%) | 16 (14%) | 11 (10%) |
Female (n, %) | 34 (30.4%) | 3 (33.3%) | 8 (50%) | 6 (54.5%) |
Male (n, %) | 78 (69.6%) | 6 (66.7%) | 8 (50%) | 5 (45.5%) |
Median age (years, IQR) | 69 (55–76) | 74 (64–78) | 57.5 (50.5–69) | 61 (51.5–73) |
Treatment information | ||||
Single checkpoint inhibition (n, %) | 95 (84.8%) | 7 (77.8%) | 14 (87.5%) | 10 (90.9%) |
Dual checkpoint inhibition (n, %) | 17 (15.2%) | 2 (22.2%) | 2 (12.5%) | 1 (9.1%) |
Prior treatments (cumulative): | ||||
Total | 69 (61%) | 5 (55%) | 10 (62%) | 8 (72%) |
Ipilimumab | 53 (76.8%) | 4 (80%) | 7 (70%) | 6 (75%) |
Ipilimumab + Nivolumab | 3 (4.4%) | 0 (0%) | 0 (0%) | 0 (0%) |
BRAF Inhibitor | 2 (2.9%) | 0 (0%) | 1 (10%) | 0 (0%) |
MEK Inhibitor | 4 (5.8%) | 0 (0%) | 0 (0%) | 1 (12.5%) |
Chemotherapy | 7 (10.1%) | 1 (20%) | 2 (20%) | 1 (12.5%) |
Lesion details | ||||
Total number of lesions (baseline) | 716 (100%) | 40 (5.6%) | 93 (13.0%) | 101 (14.1%) |
Mean number of lesions per patient | 6.4 | 4.5 | 5.8 | 9.2 |
1 = soft tissue | 378 (52.8%) | 29 (72.5%) | 66 (71.0%) | 50 (49.5%) |
2 = lung | 161 (22.5%) | 6 (15.0%) | 12 (12.9%) | 23 (22.8%) |
3 = liver/spleen | 128 (17.9%) | 5 (12.5%) | 9 (9.7%) | 15 (14.9%) |
4 = bone | 47 (6.6%) | 0 (0%) | 6 (6.5%) | 13 (12.9%) |
5 = heart | 2 (0.3%) | 0 (0%) | 0 (0%) | 0 (0%) |
Patients with new lesions at TP1 | 46 (41%) | 5 (55%) | 12 (75%) | 8 (73%) |
Patients with new lesions at TP2 | 29 (26%) | 2 (22%) | 10 (62%) | 7 (64%) |
. | . | Patients with progressive lesions at TP1 . | ||
---|---|---|---|---|
Patient characteristics . | All patients . | PP-only . | TPD-only . | Mixed PP & TPD . |
General | ||||
Total (n, %) | 112 (100%) | 9 (8%) | 16 (14%) | 11 (10%) |
Female (n, %) | 34 (30.4%) | 3 (33.3%) | 8 (50%) | 6 (54.5%) |
Male (n, %) | 78 (69.6%) | 6 (66.7%) | 8 (50%) | 5 (45.5%) |
Median age (years, IQR) | 69 (55–76) | 74 (64–78) | 57.5 (50.5–69) | 61 (51.5–73) |
Treatment information | ||||
Single checkpoint inhibition (n, %) | 95 (84.8%) | 7 (77.8%) | 14 (87.5%) | 10 (90.9%) |
Dual checkpoint inhibition (n, %) | 17 (15.2%) | 2 (22.2%) | 2 (12.5%) | 1 (9.1%) |
Prior treatments (cumulative): | ||||
Total | 69 (61%) | 5 (55%) | 10 (62%) | 8 (72%) |
Ipilimumab | 53 (76.8%) | 4 (80%) | 7 (70%) | 6 (75%) |
Ipilimumab + Nivolumab | 3 (4.4%) | 0 (0%) | 0 (0%) | 0 (0%) |
BRAF Inhibitor | 2 (2.9%) | 0 (0%) | 1 (10%) | 0 (0%) |
MEK Inhibitor | 4 (5.8%) | 0 (0%) | 0 (0%) | 1 (12.5%) |
Chemotherapy | 7 (10.1%) | 1 (20%) | 2 (20%) | 1 (12.5%) |
Lesion details | ||||
Total number of lesions (baseline) | 716 (100%) | 40 (5.6%) | 93 (13.0%) | 101 (14.1%) |
Mean number of lesions per patient | 6.4 | 4.5 | 5.8 | 9.2 |
1 = soft tissue | 378 (52.8%) | 29 (72.5%) | 66 (71.0%) | 50 (49.5%) |
2 = lung | 161 (22.5%) | 6 (15.0%) | 12 (12.9%) | 23 (22.8%) |
3 = liver/spleen | 128 (17.9%) | 5 (12.5%) | 9 (9.7%) | 15 (14.9%) |
4 = bone | 47 (6.6%) | 0 (0%) | 6 (6.5%) | 13 (12.9%) |
5 = heart | 2 (0.3%) | 0 (0%) | 0 (0%) | 0 (0%) |
Patients with new lesions at TP1 | 46 (41%) | 5 (55%) | 12 (75%) | 8 (73%) |
Patients with new lesions at TP2 | 29 (26%) | 2 (22%) | 10 (62%) | 7 (64%) |
Note: Patient characteristics of all included patients (n = 112), as well as the defined groups: PP-only, TPD-only, and mixed PP and TPD.
Abbreviation: IQR, interquartile range.
The number of lesions per patient and lesion location were distributed equally between PP-only and TPD-only patients. Patients with mixed PP/TPD presented a higher mean number of 9.2 metastases per patient and had overall less soft tissue lesions compared with the other two groups, resembling a distribution closer to that of all patients combined. The percentage of patients receiving dual checkpoint inhibition and prior anti–CTLA-4 treatment was evenly distributed among all groups. The overall percentage of patients having new lesions at TP1 and TP2 was 41% and 26%, respectively. PP-only patients presented with comparable numbers of 55% and 22%, respectively. TPD-only patients and mixed PP and TPD patients had a higher percentage of new lesions with 75%/73% (TP1) and 62%/64% (TP2), respectively.
Lesion-level analysis
To classify a single lesion into either pseudoprogression or true progression, a total of 3 timepoints are necessary: baseline lesion diameter (TP0), diameter increase by ≥20% at 3 months (TP1), and confirmation of either true progressive disease, or pseudoprogression at 6 months (TP2). Therefore, the analysis of the individual lesion response per timepoint was limited to the 645 lesions that were available at all 3 timepoints.
Of these 645 lesions, 82 (13%) showed complete remission at the first follow-up at 3 months (TP1). 122 (19%) showed partial remission and 335 (52%) lesions were stable. Importantly, 106 (16%) lesions showed a progression by ≥20% at TP1, of which 30 lesions (4.7% of all lesions, 28.3% of progressive lesions) were defined as PP at TP2. 76 lesions (11.8% of all lesions, 71.7% of progressive lesions) remained progressive throughout TP1 and TP2 and were classified as TPD. The development of PP was not associated with metastasis location (P = 0.40) or treatment type (single vs. dual ICI, P = 0.12). Figure 2 illustrates the changes in response between TP1 and TP2.
PP prediction models
A total of seven model classes for the prediction of pseudoprogression were considered, including multimodality approaches. Figure 3 shows the best performing ROC curves for all model modalities. The feature weights of the individual models, as well as partial dependence plots are provided in Supplementary Figs. S1 and S2. Table 2 provides an overview of all model metrics including AUC, sensitivity [true positive rate (TPR)], specificity [true negative rate (TNR)], positive/negative predictive value, and positive/negative likelihood ratio.
Model classes . | Included parameters/features . | AUC . | Sensitivity (TPR) . | Specificity (TNR) . | Positive predictive value (PPV) . | Negative predictive value (NPV) . | Positive likelihood ratio (LR+) . | Negative likelihood ratio (LR−) . |
---|---|---|---|---|---|---|---|---|
. | . | (SD) . | (95% CI) . | (95% CI) . | (95% CI) . | (95% CI) . | (95% CI) . | (95% CI) . |
Blood | LDH, S100 | 0.71 (±0.07) | 0.69 (0.52–0.85) | 0.67 (0.56–0.77) | 0.45 (0.30–0.59) | 0.84 (0.75–0.93) | 2.06 (1.38–3.07) | 0.47 (0.27–0.82) |
Volume (TP1) | Ct volume | 0.72 (±0.13) | 0.76 (0.60–0.91) | 0.60 (0.49–0.71) | 0.42 (0.29–0.56) | 0.86 (0.77–0.95) | 1.87 (1.33–2.63) | 0.41 (0.21–0.79) |
Delta volume (TP1 vs. TP0) | ΔCt volume | 0.80 (±0.10) | 0.81 (0.67–0.95) | 0.67 (0.56–0.77) | 0.49 (0.35–0.63) | 0.90 (0.82–0.98) | 2.43 (1.69–3.49) | 0.28 (0.13–0.60) |
Radiomics (TP1 - CT) | Ct center mass shift | 0.69 (±0.12) | 0.71 (0.55–0.87) | 0.60 (0.49–0.71) | 0.41 (0.28–0.54) | 0.84 (0.74–0.94) | 1.76 (1.23–2.51) | 0.48 (0.27–0.88) |
Radiomics (TP1 - PET) | PET information measures of correlation 2, PET large zone high grey level emphasis | 0.68 (±0.13) | 0.48 (0.30–0.66) | 0.80 (0.71–0.89) | 0.48 (0.30–0.66) | 0.79 (0.70–0.89) | 2.37 (1.32–4.24) | 0.65 (0.46–0.94) |
Delta-radiomics (TP1 vs. TP0) (including volume-related features) | ΔCt volume, ΔCt fractal dimension | 0.79 (±0.09) | 0.81 (0.67–0.95) | 0.67 (0.56–0.77) | 0.49 (0.35–0.63) | 0.90 (0.82–0.98) | 2.43 (1.69–3.49) | 0.28 (0.13–0.60) |
Delta-radiomics (TP1 vs. TP0) (excluding volume-related features) | ΔCt coarseness | 0.78 (±0.08) | 0.89 (0.78–1.00) | 0.53 (0.41–0.64) | 0.42 (0.30–0.55) | 0.92 (0.84–1.00) | 1.87 (1.43–2.45) | 0.21 (0.08–0.60) |
Blood and Volume | LDH, S100, ΔCt volume | 0.79 (±0.08) | 0.80 (0.66–0.94) | 0.67 (0.56–0.77) | 0.49 (0.35–0.63) | 0.89 (0.81–0.97) | 2.40 (1.67–3.46) | 0.30 (0.14–0.62) |
Blood and Radiomics (including volume-related features) | LDH, ΔCt volume, ΔCt fractal dimension, ΔCt 10th percentile | 0.78 (±0.09) | 0.84 (0.71–0.97) | 0.67 (0.56–0.77) | 0.50 (0.36–0.64) | 0.92 (0.84–0.99) | 2.53 (1.78–3.61) | 0.23 (0.10–0.55) |
Blood and Radiomics (excluding volume-related features) | LDH, ΔCt coarseness | 0.82 (±0.09) | 0.81 (0.67–0.95) | 0.73 (0.63–0.83) | 0.54 (0.39–0.69) | 0.91 (0.83–0.98) | 2.97 (1.98–4.46) | 0.26 (0.12–0.55) |
Model classes . | Included parameters/features . | AUC . | Sensitivity (TPR) . | Specificity (TNR) . | Positive predictive value (PPV) . | Negative predictive value (NPV) . | Positive likelihood ratio (LR+) . | Negative likelihood ratio (LR−) . |
---|---|---|---|---|---|---|---|---|
. | . | (SD) . | (95% CI) . | (95% CI) . | (95% CI) . | (95% CI) . | (95% CI) . | (95% CI) . |
Blood | LDH, S100 | 0.71 (±0.07) | 0.69 (0.52–0.85) | 0.67 (0.56–0.77) | 0.45 (0.30–0.59) | 0.84 (0.75–0.93) | 2.06 (1.38–3.07) | 0.47 (0.27–0.82) |
Volume (TP1) | Ct volume | 0.72 (±0.13) | 0.76 (0.60–0.91) | 0.60 (0.49–0.71) | 0.42 (0.29–0.56) | 0.86 (0.77–0.95) | 1.87 (1.33–2.63) | 0.41 (0.21–0.79) |
Delta volume (TP1 vs. TP0) | ΔCt volume | 0.80 (±0.10) | 0.81 (0.67–0.95) | 0.67 (0.56–0.77) | 0.49 (0.35–0.63) | 0.90 (0.82–0.98) | 2.43 (1.69–3.49) | 0.28 (0.13–0.60) |
Radiomics (TP1 - CT) | Ct center mass shift | 0.69 (±0.12) | 0.71 (0.55–0.87) | 0.60 (0.49–0.71) | 0.41 (0.28–0.54) | 0.84 (0.74–0.94) | 1.76 (1.23–2.51) | 0.48 (0.27–0.88) |
Radiomics (TP1 - PET) | PET information measures of correlation 2, PET large zone high grey level emphasis | 0.68 (±0.13) | 0.48 (0.30–0.66) | 0.80 (0.71–0.89) | 0.48 (0.30–0.66) | 0.79 (0.70–0.89) | 2.37 (1.32–4.24) | 0.65 (0.46–0.94) |
Delta-radiomics (TP1 vs. TP0) (including volume-related features) | ΔCt volume, ΔCt fractal dimension | 0.79 (±0.09) | 0.81 (0.67–0.95) | 0.67 (0.56–0.77) | 0.49 (0.35–0.63) | 0.90 (0.82–0.98) | 2.43 (1.69–3.49) | 0.28 (0.13–0.60) |
Delta-radiomics (TP1 vs. TP0) (excluding volume-related features) | ΔCt coarseness | 0.78 (±0.08) | 0.89 (0.78–1.00) | 0.53 (0.41–0.64) | 0.42 (0.30–0.55) | 0.92 (0.84–1.00) | 1.87 (1.43–2.45) | 0.21 (0.08–0.60) |
Blood and Volume | LDH, S100, ΔCt volume | 0.79 (±0.08) | 0.80 (0.66–0.94) | 0.67 (0.56–0.77) | 0.49 (0.35–0.63) | 0.89 (0.81–0.97) | 2.40 (1.67–3.46) | 0.30 (0.14–0.62) |
Blood and Radiomics (including volume-related features) | LDH, ΔCt volume, ΔCt fractal dimension, ΔCt 10th percentile | 0.78 (±0.09) | 0.84 (0.71–0.97) | 0.67 (0.56–0.77) | 0.50 (0.36–0.64) | 0.92 (0.84–0.99) | 2.53 (1.78–3.61) | 0.23 (0.10–0.55) |
Blood and Radiomics (excluding volume-related features) | LDH, ΔCt coarseness | 0.82 (±0.09) | 0.81 (0.67–0.95) | 0.73 (0.63–0.83) | 0.54 (0.39–0.69) | 0.91 (0.83–0.98) | 2.97 (1.98–4.46) | 0.26 (0.12–0.55) |
Note: Overview of all model metrics including AUC (± SD), sensitivity (TPR; 95% CI), specificity (TNR, 95% CI), positive/negative predictive value (PPV/NPV, 95% CI) and positive/negative likelihood ratio (LR+/LR−, 95% CI).
Blood-based models
The first model was only based on the conventional blood markers LDH and S100 and achieved an AUC of 0.71 (sensitivity 0.69, specificity 0.67). Higher values of LDH and S100 indicated a low chance of PP.
Volume-based models
As a second step, two models based solely on lesion volume (metastatic size) were generated. First, we created a model correlating PP with the lesion volume at 3 months (TP1). This prediction model achieved an AUC of 0.72 (sensitivity 0.76, specificity 0.60). Smaller lesions were more likely to be PPs. The second model was based on the relative (%) difference (delta volume) between the lesion volume at 3 months (TP1) compared with baseline (TP0) and achieved an AUC of 0.80 (sensitivity 0.81, specificity 0.67). A large increase in lesion volume at TP1 was associated with a reduced likelihood of PP.
Radiomics-based models
Four types of radiomics-based prediction models were generated and were based on either single-timepoint radiomics (TP1) or multi-timepoint delta radiomics (relative difference between TP1 and TP0) with inclusion or exclusion of volume-related features. The best performing single-timepoint radiomics models achieved an AUC of 0.69 (CT, sensitivity 0.71, specificity 0.60) and 0.68 (PET, sensitivity 0.48, specificity 0.80), both performing worse compared with either the blood-based or volume-based models. The delta radiomics models performed better and achieved an AUC of 0.79 (including volume-related features; sensitivity 0.81, specificity 0.67) based on the features mc-volume and fractal dimension and an AUC of 0.78 (excluding volume-related features; sensitivity 0.89, specificity 0.53) based on the feature CT coarseness. Both, an increase in fractal dimension and a decrease in coarseness are indicating that significant changes from a uniform/homogeneous to a nonuniform/heterogeneous texture are more likely to be true progressions.
Combined blood- and volume-based model
The combined blood- and volume-based prediction model, based on lesion volume, LDH, and S100 performed equally well compared with the volume-based model and better compared with the radiomics models, achieving an AUC of 0.79 (sensitivity 0.80, specificity 0.67).
Combined blood- and radiomics-based model
A combined blood- and radiomics-based prediction model (including volume-related features) achieved an AUC of 0.78 (sensitivity 0.84, specificity 0.67). The best performing model was, however, a combination of blood- and radiomics-based prediction excluding volume-related features, which achieved an AUC of 0.82 (sensitivity 0.81, specificity 0.73). This prediction model was based on the LDH level at TP1 and the relative change of CT coarseness between TP1 and TP0. Larger values of LDH and a larger decrease in CT coarseness indicated a lower chance of PP.
Distribution of PP and TPD lesions within patients
The 30 lesions with PP were distributed across 20 patients with ≥1 PP lesion (median 1, range 1–5). The 76 lesions with TPD were distributed across a total of 27 patients. An overlap of 11 patients presented with mixed PP and TPD lesions (Fig. 4).
Patient-level analysis
Median follow-up was 22 (range 14–32.5) months. For the whole cohort of 112 patients, median OS, PFS, and iPFS were “not reached,” 6, and 16 months, respectively. Two-year OS, PFS, and iPFS were 69%, 24%, and 42%, respectively. Supplementary Table S2 summarizes the outcome of the landmark analysis for all groups. Additional versions of Fig. 4 and Supplementary Table S2 without the use of the landmark analysis are provided in Supplementary Fig. S3 and Supplementary Table S3.
In the landmark analysis, PP-only patients had a significantly longer median OS of 30 versus 10 months (P = 0.002, FWER = 0.01) compared with TPD-only patients with a 2-year OS of 100% versus 15%, and a better, however not statistically significant, iPFS of “not reached” versus 7 months (P = 0.014, FWER = 0.058). Importantly, PP-only patients did not show a significant difference in OS (P = 0.934), PFS (P = 0.500), and iPFS (P = 0.557) compared with all other patients, who only had responding or stable lesions at TP1. Patients with mixed PP and TPD presented with a worse median OS of 25 versus 30 months compared with PP-only patients (P = 0.127), but had a longer median OS of 25 versus 10 months compared with TPD-only (P = 0.058, FWER = 0.174). These differences were, however, not statistically significant under FWER = 0.05.
As the difference between the TPD-only and mixed PP and TPD cohorts was quite striking, despite the fact that both presented with truly progressive lesions, we compared the overall tumor burden of both cohorts at all three timepoints. The mean baseline (TP0) tumor burden was higher in the mixed PP and TPD group (121 ± 177 cc) compared with the TPD-only group (86 ± 90 cc). This reversed for TP1, where the TPD-only group presented with a higher mean tumor burden of 280 ± 417 cc, compared with 133 ± 175 cc in the mixed PP and TPD group, which was confirmed at TP2 with 383 ± 542 cc (TPD-only) and 153 ± 188 cc (mixed PP and TPD). These differences were, however, not statistically significant with P values of P = 0.639, P = 0.786, and P = 0.415 for TP0, TP1, and TP2, respectively.
Discussion
Incidence of PP
The incidence of PP has been described to be 4%–10% in most studies of patients with metastatic melanoma treated with ICI (3, 38–41). In our analysis, 8% of patients presented with PP only. Our study is, however, the first to include a lesion-individual analysis of PP, showing an overall rate of 4.7%. It is especially noteworthy that the 30 PP lesions in our analysis constitute almost one third (28.3%) of the 106 progressive lesions at TP1 and that the 9 PP-only patients constitute almost one third of patients with PD at TP1. This suggests that, although PP might be a rare phenomenon overall, it appears to be quite common among patients with progressive disease.
The improved outcome of PP-only patients with an impressive two-year OS of 100% was comparable with patients who did not have any progressive lesion at TP1 and is in line with an improved OS of patients with PP in other studies (15, 39). TPD-only patients had a significantly shorter OS, iPFS, and a drastically lower 2-year OS of only 14% indicating the need for an early change in treatment strategy. Patients with mixed PP and TPD showed a survival, which was worse compared with PP-only, but better compared with TPD-only patients, although both groups had lesions with true progression and presented with comparable rates of new lesions at TP1 and TP2. This suggests that a lesion-individual response assessment might provide advantages for patient stratification. In addition, we assessed if the overall tumor burden could be responsible for the difference between both groups. Interestingly the baseline tumor burden was actually higher in the mixed PP and TPD group compared with the TPD-only group, suggesting that overall tumor burden may not be the only critical factor.
An explanation might be that mixed PP and TPD patients could have had a slowly developing and/or resolving PP of lesions that were still classified as PD at 6 months. Some studies described continued benefit from immunotherapy, despite confirmed progression according to iRECIST criteria (40–43). It could therefore be possible that some lesions that were defined as confirmed progression in our analysis may respond much later during treatment, as these delayed tumor response dynamics have been described for both patients with melanoma and NSCLC by Nishino and colleagues (43, 44) and Hodi and colleagues (39).
In addition, Kong and colleagues described potentially prolonged residual disease following anti–PD-1 treatment with metabolically inactive lesions on PET/CT imaging (45). This indicates that even iRECIST/irRC have limitations and that it might be necessary to further increase confirmation follow-up time, include additional imaging information (e.g., radiomics), analyze response more deeply on a lesion-individual level, and finally to search for additional biomarkers.
Radiomics
The potential of radiomics to predict treatment response and outcome has been recognized for many tumor types and imaging modalities (46, 47). Tang and colleagues demonstrated that certain radiomic features (low CT intensity/high heterogeneity) predict infiltrating CD3+ lymphocytes and PD-L1 expression, associated with outcome in patients with non–small cell lung cancer (NSCLC; ref. 48). As mentioned earlier, Sun and colleagues could predict ICI responders by baseline CT radiomic signatures (29). Trebeschi and colleagues also used baseline CT radiomics for response prediction of patients with metastatic melanoma and NSCLC treated with ICI (49). While their model worked well in patients with NSCLC, it performed poorly in patients with melanoma, which the authors attributed to the large variety of first-line treatments prior to ICI. In our analysis, approximately half of the patients received first-line anti–PD-1 antibodies, while the other half was mainly pretreated with anti–CTLA-4 antibodies, representing a much more homogeneous patient cohort. In addition, our analysis included a 50% larger number of 716 analyzed lesions (112 patients) compared with 483 lesions (80 patients) and an additional third timepoint at 6 months.
A potential explanation why radiomic signatures are able to predict response or differentiate between PP and TPD are the previously described biological differences in lesions with either true progression or PP (22). In the case of TPD, a lesion could consist of mainly tumor cells alone. For PP, the picture could be much more heterogeneous on a cellular level, with a plethora of different involved immune cells, such as dendritic cells, cytotoxic CD8+ T cells, macrophages, natural killer cells, and others. This could potentially also lead to differences in radiomic features on CT and PET imaging. Trebeschi and colleagues reported that lesions with a greater morphologic heterogeneity were more likely to respond to immunotherapy (49). Our analysis shows that an increase in fractal dimension and a decrease in coarseness are more likely to be true progressions. Both indicate a change from a homogeneous to a heterogeneous texture. It might therefore be necessary to not only assess the heterogeneity of a lesion on a single timepoint but also the relative change of heterogeneity between different timepoints (delta radiomics) to adequately assess the also dynamic biological processes of immune cell infiltration into the tumor. Together, this could provide a basis for the detection of differences in an in-depth radiomics analysis, although these results will need to be confirmed in future prospective clinical trials. Interestingly, the solely volume-based models showed a slightly higher performance compared with the solely radiomics-based models in our analysis. As response assessment is already based on tumor volume, volume-based prediction models may therefore provide a simple and practical approach.
Biomarkers
Surprisingly, biomarkers for ICI response assessment have not evolved substantially over the past decade. PD-L1 expression is one of the only clinically established biomarkers for predicting the response to ICI. However, some studies have also presented contradictory results, as both PD-L1–positive and -negative patients were shown to benefit from ICI (50, 51). Tumor-infiltrating lymphocytes (TIL), as well as neoantigens are other promising biomarkers and a higher number of TILs or neoantigens was associated with better outcomes in a variety of cancers (11, 51–53). Serum biomarkers provide multiple advantages over pathology-based biomarkers, as they can be obtained noninvasively and repetitively throughout the treatment course (54). ctDNA was used to discriminate PP from TPD in 2 patients with lung adenocarcinoma treated with ICI (55). Yoshimura and colleagues described a decrease in serum cytokeratin 19 fragment levels in the setting of PP under ICI (56). Lee and colleagues performed an analysis in 125 patients with metastatic melanoma treated with ICI, differentiating patients with PP from TPD via favorable ctDNA and unfavorable ctDNA profiles (15). 1-year OS was significantly higher in patients with favorable ctDNA (82%) compared to patients with unfavorable ctDNA (39%). Their numbers and percentages of patients with TPD were almost identical to our cohort, and also a third (31%) of their patients with TPD at TP1 had PP. TMB has also been assessed as a predictive biomarker, although two recent studies have shown that TMB was not associated with the efficacy of pembrolizumab in patients with NSCLC (57, 58). Other advanced biomarkers are being investigated including high-dimensional single-cell mass cytometry, which was shown to be able to predict the response to anti–PD-1 immunotherapy in patients with metastatic melanoma (59). In our analysis, the solely blood-based model was the worst performing of all model classes with an AUC of 0.71 but blood in conjunction with either tumor volume or radiomics led to the best performing models, suggesting that multimodality models may represent the best approach.
Multimodality
It has been shown that combined multimodality approaches for outcome prediction could perform better than single modality clinical, radiomics, or genomic data (60). We did not identify a study focusing on the noninvasive differentiation of PP and true progression via radiomics and routine blood markers. The radiomics models of Sun and colleagues and Trebeschi and colleagues were solely based on CT imaging without PET imaging and neither included a multi-timepoint delta radiomics analysis or a combined multimodality model.
The predictive performance of FDG-PET/CT and blood markers for general response assessment in melanoma has only been compared in the setting of chemotherapy, surgery, or radiotherapy. Aukema and colleagues reported a modest 50% positive predictive value of S100 for recurrent disease and recommended FDG-PET/CT for confirmation of recurrences (61). Wieder and colleagues concluded that the diagnostic accuracy and prognostic power of PET/CT is superior to S100 (62). Strobel and colleagues reported that S100 alone was not suitable for response assessment in 37% of patients, because S100 values were normal prior and after treatment, whereas PET/CT was suitable for all patients (63). Together, these results suggest that PET/CT imaging and the routine blood markers LDH and S100 could be combined for improved response assessment, but their use as a combined biomarker in immunotherapy has not been examined systematically.
We could show that the combination of delta radiomics and LDH was the best performing model, performing better than all other model classes. The multimodality model indicated that high levels of LDH and lesions with a large decrease in CT coarseness have a low chance of being a PP.
Limitations
There are some limitations of our study, including the retrospective character of the analysis and the overall limited number of PP lesions and patients in the individual groups (PP-only, TPD-only, PP and TPD).
No external validation of our model has been performed. The most straightforward way to select and test the optimal model is to split the dataset into three parts: the training set, the validation set, and the test set. The training set is used to fit the models to the data, the validation set is used to select the optimal model, and the test set is used to estimate the expected predictive performance in an independent dataset, that is, generalization performance. This is a recommended approach in large datasets. However, in small datasets the way the dataset is split can significantly affect the performance scores in both validation and testing contributing to a high variance of the performance estimate. A common approach to reduce this variance is cross-validation. In cross-validation, the process of splitting the data to training and validation parts is repeated multiple times, reducing variance by averaging the performance scores (64). However, it has been shown by Stone (65) that cross-validation can lead to optimistically biased performance estimates when it is used for both model selection and predictive performance assessment (64). A solution to this problem was proposed by Varma and Simon (66) who showed that nesting two cross-validation loops, where the inner loop selects the optimal model and the outer loop performs the model assessment, mitigates this issue resulting in an estimation of the true generalization performance with very low bias. In small datasets, nested cross-validation is therefore superior to a training/validation/test split or a single cross-validation as it provides an almost unbiased estimate of true generalization performance with low variance (67).
However, small sample size may have prevented the multivariate models from reaching full capacity. A larger patient cohort could therefore lead to an improved performance of the multivariate models. Another limitation is the heterogenous PET/CT acquisition and reconstruction techniques, which is owing to the retrospective nature of our study. Nevertheless, despite these limitations, our study is by far the largest lesion-level analysis of patients with metastatic melanoma treated with ICI and the only one to include three separate timepoints, blood biomarkers, delta radiomics, and PET/CT imaging.
In conclusion, this study reports the first multimodality radiomics analysis using FDG-PET/CT imaging in the setting of immune checkpoint inhibition. It is also the first to include a multiple timepoint delta radiomics analysis in a multimodality prediction model in conjunction with the blood markers LDH and S100. Noninvasive PET/CT-based radiomics and LDH/S100 are promising biomarkers for early differentiation of PP from true progression at an early timepoint of 3 months, which might help reduce added toxicity or delayed treatment switch in patients with metastatic melanoma treated with ICIs.
Disclosure of Potential Conflicts of Interest
M. W. Huellner reports receiving commercial research grants—to him and his institution—from GE Healthcare; reports receiving speakers bureau honoraria from GE Healthcare; and reports receiving other remuneration in the form of grants from Alfred and Annemarie von Sick and the Artificial Intelligence in Oncological Imaging Network. R. Dummer is a paid consultant or advisory board member for Novartis, Merck Sharp & Dohme, Bristol-Myers Squibb, Roche, Amgen, Takeda, Pierre Fabre, Sun Pharma, Sanofi, Catalym, Second Genome, Regeneron, and Alligator. No potential conflicts of interest were disclosed by the other authors.
Authors' Contributions
Conception and design: L. Basler, S.A. Hogan, M. Bogowicz, S. Tanadini-Lang, R. Förster, R. Dummer, M. Guckenberger, M.P. Levesque
Development of methodology: L. Basler, M. Bogowicz, S. Tanadini-Lang, R. Förster, R. Dummer, M. Guckenberger, M.P. Levesque
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): L. Basler, S.A. Hogan, M. Pavic, R. Dummer, M. Guckenberger, M.P. Levesque
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): L. Basler, H.S. Gabryś, S.A. Hogan, S. Tanadini-Lang, R. Dummer, M. Guckenberger, M.P. Levesque
Writing, review, and/or revision of the manuscript: L. Basler, H.S. Gabryś, M. Pavic, M. Bogowicz, D. Vuong, S. Tanadini-Lang, R. Förster, K. Kudura, M.W. Huellner, R. Dummer, M. Guckenberger, M.P. Levesque
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): L. Basler, H.S. Gabryś, S.A. Hogan, M.W. Huellner, R. Dummer, M. Guckenberger, M.P. Levesque
Study supervision: R. Dummer, M. Guckenberger, M.P. Levesque
Acknowledgments
This work was supported by Cancer Research Center, Comprehensive Center Zurich, University Hospital Zurich (CRC_13, to L. Basler); SwissNational Fund (SNF 310030_170159, to H.S. Gabry); and European Training Network MELGEN funded consortium (no. 641458, to S.A. Hogan).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.