Purpose: With the approval of immunotherapies for a variety of indications, methods to assess treatment benefit addressing the response patterns observed are important. We evaluated RECIST criteria–based overall response rate (ORR) and progression-free survival (PFS) as potential surrogate endpoints of overall survival (OS), and explored a modified definition of PFS by altering the threshold percentage determining disease progression to assess the association with survival benefit in immunotherapy trials.
Experimental Design: Thirteen randomized, multicenter, active-control trials containing immunotherapeutic agents submitted to the FDA were analyzed. Associations between treatment effects of ORR, PFS, modified PFS, and OS were evaluated at individual and trial levels. Patient-level responder analysis was performed for PFS and OS.
Results: The coefficient of determination (R²) measured the strength of associations, where values near 1 imply surrogacy and values close to 0 suggest no association. At the trial level, the association between hazard ratios (HR) of PFS and OS was R2 = 0.1303, and between the odds ratio (OR) of ORR and HR of OS was R2 = 0.1277. At the individual level, the Spearman rank correlation coefficient between PFS and OS was 0.61. Trial-level associations between modified PFS and OS ranged between 0.07 and 0.1, and individual-level correlations were approximately 0.6. HRs of PFS and OS for responders versus nonresponders were 0.129 [95% confidence interval (CI), 0.11–0.15] and 0.118 (95% CI, 0.11–0.13), respectively.
Conclusions: Although responders exhibited longer survival and PFS than nonresponders, the trial-level and individual-level associations were weak between PFS/ORR and OS. Modifications to PFS did not improve associations. Clin Cancer Res; 24(10); 2268–75. ©2018 AACR.
See related commentary by Korn and Freidlin, p. 2239
In recent times, many immunotherapies, in particular checkpoint inhibitors, have demonstrated superior overall survival (OS) compared with existing standard of care. Future clinical trials (CT) may take much longer to complete in order to demonstrate incremental improvement in survival, and OS may not be a feasible endpoint for the evaluation of new drug products seeking marketing approval. This research evaluated if intermediate endpoints such as overall response rate, progression-free survival (PFS), or modified PFS can replace OS as a surrogate endpoint in determining the treatment effect of immunotherapies in randomized CTs. The retrospective analyses of past CTs suggest that at the individual level, a patient whose tumor responds or whose time to disease progression is longer lives longer. However, the analyses also suggest that in a randomized CT where an improvement in tumor response or improvement in PFS over existing standard of care is seen, the chances of demonstrating an improvement in OS is small.
Immunotherapies, which utilize and enhance the distinctive powers of the immune system to fight cancer, are proving to be the most promising cancer treatments today. Although every cancer type is unique, immunotherapies have demonstrated various levels of clinical benefit across different cancer types. As a class, immunotherapies encompass several different types of treatments, such as checkpoint inhibitors and cancer therapeutic vaccines. Each of these treatments differ in their mechanism of action, and immune checkpoint inhibitors in particular represent a strong future for cancer treatment.
One of the first immunotherapies, ipilimumab, approved by the FDA in 2011, was the first checkpoint inhibitor to block the cytotoxic T-lymphocyte–associated antigen 4 (CTLA-4) pathway and was shown to extend survival among patients with advanced melanoma. Pembrolizumab and nivolumab, both checkpoint inhibitors that block the interaction between PD-1 and its ligands PD-L1 and PD-L2, received accelerated approval (1, 2) from the FDA in the last quarter of 2014 based on overall response rate (ORR). Similarly, atezolizumab was approved in 2016 under accelerated approval regulation (3) based on ORR for both urothelial carcinoma and non–small cell lung cancer (NSCLC). In addition, avelumab and durvalumab, both PD-L1 agents, received accelerated approval in 2017 based on ORR for the treatment of urothelial carcinoma. Avelumab is also indicated for Merkel cell carcinoma. The ultimate goal of these agents is to prolong the life span of patients with cancer; however, some issues arise in considering the traditional efficacy endpoints for evaluating these drugs.
In evaluating efficacy, overall survival (OS) provides one of the most direct measures of true clinical benefit. OS, defined as the time from random assignment to the date of death from any cause, is a precise, objectively measured, and easy-to-interpret clinical outcome, and has been considered the most reliable and clinically meaningful endpoint for evaluating drug efficacy in oncology trials. However, due to the long-term survival advantages observed in patients treated with immunotherapies, OS may not be a realistic primary endpoint for future trials. In addition, OS may be confounded by subsequent treatments, such as a crossover to immunotherapy after disease progression. It is due to these limitations of the OS endpoint and the intent to provide patients with access to promising agents as early as possible that surrogate endpoints such as ORR and progression-free survival (PFS) have been useful in evaluating new treatments. These endpoints are primarily defined by changes in tumor burden and are assessable earlier, require shorter study durations and follow-up time, and may not need the larger sample sizes required in studies with OS as the primary endpoint. In addition, in single-arm trials, ORR is the only endpoint that can be meaningfully interpreted.
The most commonly used criteria to define progression and tumor response to treatment are RECIST (4, 5). However, the RECIST criteria were developed based on experience with cytotoxic agents, and it has been observed that the mechanisms of action of immunotherapeutic agents are markedly different from cytotoxic drugs. There is a need to evaluate if the RECIST criteria are appropriate to define ORR and PFS in patients receiving immunotherapies.
Atypical response patterns in patients receiving immunotherapies have led to challenges in evaluating clinical benefit using the traditional RECIST criteria. Patients may exhibit delay in response, immediate reduction to target tumor burden accompanied by presence of a new lesion, or even an initial increase in tumor burden (tumor flare or pseudo-progression) followed by a decrease in tumor burden. At the initial stage of treatment, a pseudo-progression may be observed, leading to discontinuation of treatment despite the potential of a delayed clinical response. To address the challenges observed in evaluating disease progression based on a RECIST-defined PFS endpoint, we evaluated ORR, PFS, and a modified RECIST criteria–defined PFS endpoint as potential surrogate endpoints for OS.
Several methods have been proposed in the literature by Prentice (6) and Buyse and colleagues (7–9) to validate surrogate endpoints (10–12). The associations can be characterized using a simple Spearman method when censoring is not an issue or evaluated using multitrial joint modeling methods, such as a copula bivariate survival model (13). In part 1 of our analysis, we present evaluations of RECIST criteria–based ORR and PFS as surrogate endpoints for OS using the individual-level and trial-level surrogacy described by Oba and colleagues (14) irrespective of site and histology of disease. Furthermore, using these same methods, we explore a modified RECIST criteria for PFS as a surrogate endpoint of OS in part 2.
Materials and Methods
We conducted a meta-analysis using the data from immunotherapy trials of anti–PD-1/PD-L1 agents submitted to the FDA as either initial or supplemental biological license applications for marketing approval across multiple indications between 2014 and 2016. For our analysis, we selected randomized, multicenter, active-controlled trials that were designed with head-to-head comparisons or add-on designs. Tumor assessments were made using RECIST v1.1 criteria for patients in the intent-to-treat population (as randomized) who had measurable lesions at baseline and at least one post-baseline visit.
The outcome measures considered in this analysis were OS, PFS, and ORR. OS was defined as the time since randomization to death. OS was censored at the last follow-up date for patients who were alive at the time of data cutoff. ORR was defined as the proportion of patients who achieved a complete or partial response as their best overall response based on RECIST v1.1 criteria. Patients with nonevaluable or unknown response status were considered nonresponders. PFS was defined as the time since randomization to progression or death, whichever occurred first. The censoring rules for PFS followed the censoring rules described in the FDA guidance on clinical trials endpoints (15). Disease progression (PD) was determined using criteria as described by RECIST v1.1, where a significant increase (≥20%) in size of target lesions as compared with nadir with an absolute increase in size of at least 5 mm, development of new lesion(s), or unequivocal progression in an existing nontarget tumor was defined as progression.
In part 1 of the analysis, data and results based on ORR, PFS, and OS as submitted by the applicant were used. In part 2 of the analysis, a consistent approach was implemented using investigator-evaluated tumor assessments to define progression using the standard RECIST v1.1 criteria across all trials. The threshold percentage for defining PD due to target lesions, based on percent increase in sum of the longest diameters of all target lesions from nadir, was modified to range from the RECIST definition of 20% up to 45% in 5% increments. In this modified PFS analysis, the date of unequivocal progression of nontarget tumors and existence of new lesions were utilized as recorded in the database to indicate progression.
Surrogacy evaluations were performed at the trial and individual levels. The Spearman rank correlation coefficient was used to assess individual-level associations between PFS and OS considering the individual patient data from all the clinical trials. A patient-level responder analysis was performed to compare PFS and OS between responders and nonresponders irrespective of the treatment assignment using the pooled data set. In this analysis, the hazard ratios (HR) of PFS and OS were calculated using the Cox proportional hazards models stratified by study. Of note, this responder analysis was a nonrandomized comparison between the treatment groups among the responders.
For trial-level analyses, the associations between treatment effect based on ORR, PFS, and OS were evaluated using a weighted linear regression model with weights equal to the sample size of each randomized comparison. The treatment effects based on PFS and OS were measured by the log of HR estimated using an unstratified Cox proportional hazards model with treatment as the covariate. The odds ratio (OR) was considered when assessing the relative effect of ORR. The coefficient of determination (R²) value computed from the weighted linear regression model was used to measure the strength of association between PFS/ORR and OS. These associations are presented graphically using bubble plots. The trial-level association defined at the population level evaluates the ability of a surrogate endpoint to predict the effect of treatment on the true endpoint. An R² value close to 1 suggests a perfect surrogate endpoint, and an R2 equal to 0 suggests no association between the surrogate endpoint and the true benefit endpoint. Because achieving R² = 1 is nearly impossible, we prospectively chose R² = 0.80 as the cutoff value to establish PFS and ORR as a validated surrogate endpoint for OS in this specific context. A surrogate endpoint is considered to be validated for use in phase III clinical trials when a strong association is demonstrated at both individual and trial levels.
We identified 13 randomized clinical trials consisting of 6,722 patient records. Of note, there were four trials with three treatment arms (two experimental arms and one control arm). For the purpose of this analysis, these three-arm trials were treated as two separate trials each, duplicating the control arm in both trials. As a result, there were 17 randomized comparisons available for analysis from the 13 clinical trials selected. These trials included both head-to-head (treatment A vs. treatment B) comparisons, where an experimental monotherapy (treatment A) is compared directly against an active control (treatment B), and add-on (treatment A vs. treatment A + B) treatment comparisons, where an experimental agent is added to a standard-of-care background.
Table 1 summarizes the trials included in the analyses (16–28). For each study, the disease indication, primary or coprimary and secondary endpoints, total number of randomized patients and corresponding randomization allocation ratio, unstratified HR of OS, PFS as defined by RECIST criteria v1.1, unstratified OR for ORR, and response rates in the experimental versus control arm are listed.
Diseases represented in these trials include melanoma (6), NSCLC (5; 1 for squamous and 4 for nonsquamous), renal cell carcinoma (1), and head and neck cancer (1). In two of these trials, patients in the control arm were allowed to cross over to the experimental arm upon documented progression. The HRs ranged between 0.42 and 0.86 for OS and 0.41 and 0.93 for PFS, and the minimum OR for ORR was 1.67.
Part 1: Responder analysis between PFS and OS
Based on the pooled analysis of 6,722 patients, the ORR was 32% in the experimental groups and 12.7% in the control group. Figure 1A and B present the Kaplan–Meier plots for OS and PFS, respectively, for the responders and nonresponders in the experimental and control arms. The responders had longer survival times compared with nonresponders [HR = 0.14; 95% confidence interval (CI), 0.12–0.16], irrespective of treatment group. Similarly, the responders had longer PFS (HR = 0.12; 95% CI, 0.11–0.14).
Part 1: Individual-level and trial-level association results
The individual-level association based on rank correlation between PFS and OS was not strong (Ρ = 0.61). The trial-level association between PFS and OS is shown in Fig. 2A. In this figure, each dot in the plot represents a single trial, with the size of each dot proportional to the size of the trial and color of the dot referring to the disease indication. The line represents the regression line fitted using the HR of PFS as a predictor of OS HR on a logarithmic scale. The trial-level R2 value computed using the weighted linear regression model was 0.1303, demonstrating a weak association between PFS and OS. The trial-level R2 value measuring the association between ORR and OS was only 0.1277, representing a weak association between ORR and OS (Fig. 2B). The OR for ORR in one study was an outlier (13.56; Table 1), and excluding this from the analysis resulted in an R2 value of 0.1213. This suggests that, although responders lived longer without disease progression and death, both PFS and ORR are not good surrogate endpoints for OS.
Part 2: Modified PFS analysis
Using data from the 13 randomized trials, the typical disease progression assessment based on the standard RECIST v1.1 criteria in the treatment and control arms is summarized in Table 2. The primary component of tumor progression was further characterized based on target, nontarget, new lesion, or progression based on new and nontarget lesions together at the same assessment. A majority of the progression events (52%) were due to a significant increase (≥20% per RECIST) in target tumor lesions.
By altering the threshold for progression from a 20% increase in sum of longest diameters in the target lesion from nadir to an increase of 25%, 30%, 35%, 40%, and 45%, we explored whether the higher burden threshold was more relevant in evaluating immunotherapeutic agents and as such, potentially better associated with OS. Table 3 presents the number of PFS events recalculated using the standard 20% cutoff as well as each modified threshold for target lesions. The first row in Table 3 contains the total PFS events from the pooled data. Beneath, results are further broken down into PFS events by study.
Due to the nature of this modification, the impact on assessment of PD and the number of PFS events decreased as the threshold values increased. The PFS events calculated using a higher threshold value were a subset of the PFS events calculated using a lower threshold value, because the criteria for PD due to target tumors were defined to meet a “minimum” percentage increase in the sum of target lesions. Thus, the PD defined using the 45% cutoff will also be included in the PD based on 20%, 25%, 30%, 35%, and 40%. The differences that were observed in number of PFS events by varying the cutoff values were due solely to the target tumor response. For example, when applying the 20% cutoff criteria (PFS_20), almost 52% of the progressions were based on target tumors, as shown in Table 2. When the cutoff value was increased to 25% (PFS_25), only 50% of progressions were due to increase in target tumor burden. For n = 128 patients, the target lesion increase in tumor size was ≥20% but <25%; thus, they were omitted when computing PFS_25. It is important to note that many of these trials allowed for treatment beyond initial RECIST-defined progression; thus, there is the potential for a patient whose initial value was between the 20% and 25% cut points to appear as a progression at a later date if his or her tumor burden increased to that higher threshold value.
Part 2: Individual-level association results between PFS and OS
The rank correlations between the observed values of PFS at different thresholds for progression and survival times of each patient were calculated to assess the individual-level associations and are presented in Fig. 3. A correlation value close to 0.6 suggests that the associations between each of the modified PFS endpoints and OS were not strong at the individual level, and the consistent correlation value suggests that the different thresholds had no impact on the associations. It is unclear what impact, if any, the nontarget and new lesion progressions played in contributing to the lack of association between PFS and OS; however, >45% of progressions were based on these types of response.
Part 2: Trial-level association results between PFS and OS
The trial-level association between OS and the traditional PFS endpoint, derived using investigator-evaluated tumor assessments to define progression based on standard RECIST criteria, was R2 = 0.0559. Figure 3 shows the trial-level associations between each of the modified PFS and OS. The R² values are low, ranging between 0.07 and 0.1 for each of the different cutoff values considered. Similar to the results observed with the RECIST-defined PFS, the modified PFS did not result in an improved surrogate endpoint for OS.
Intermediate endpoints based on tumor measurement, such as ORR and PFS, have been routinely used to evaluate new therapies in oncology trials and have served as the endpoints likely to predict clinical benefit for the consideration of accelerated approval of drugs. In recent immunotherapy studies, such as in nonsquamous NSCLC (23), despite a weak-to-moderate PFS treatment effect, significant improvement in the OS treatment effect was observed. In cases such as this, promising products may not have been approved if PFS was the primary endpoint of the study evaluating the product. Given the differing mechanisms of action of immunotherapy products compared with standard chemotherapy agents, the question arises if the definition of disease response and progression using standard RECIST v1.1 criteria are appropriate for evaluating this new and promising class of products. In trying to answer this question, we first evaluated if ORR and PFS as currently defined using RECIST v1.1 were reasonable surrogate endpoints for OS, and second, in order to improve the degree of surrogacy, we used modified criteria for progression by imposing more stringent criteria.
We investigated if there were strong associations between PFS and OS, and ORR and OS based on 13 trials with 17 randomized comparisons. Based on the responder analysis, it was observed that responders were associated with longer survival and PFS durations compared with nonresponders. The analyses showed that the surrogacy evaluations at the trial level resulted in weak associations between the RECIST criteria–based endpoints of PFS and ORR with respect to OS.
Taking into account that there may be delayed response preceded by actual disease progression, modifying the burden level at which we determine progression could potentially address this issue. Even with higher burden levels, the surrogacy evaluations both at the trial and individual levels resulted in weak associations between the modified PFS and OS, similar to the RECIST-defined PFS. Thus, using a more stringent criterion to define PFS did not result in a better association between PFS and OS. Of note, the endpoints based on tumor response may not always capture the drug's effect on OS and hence may result in weak associations. Demonstrating stronger associations between the surrogate endpoints (ORR and PFS) and OS at the individual level is a necessary but not sufficient condition to conclude that ORR and PFS can replace OS in assessing the clinical benefit in immuno-oncology trials.
The challenges noted in using RECIST to evaluate immunotherapies have not gone unnoticed in the oncology community. There have been multiple attempts to modify the RECIST algorithm, from the Immune-Related Response Criteria (irRC 2009; ref. 29) to immune-related RECIST (irRECIST 2013–2014; refs. 30, 31), and finally the most current Immune RECIST (iRECIST 2017; ref. 32) guidelines. Many developers have also defined their own set of modifications (33). Most of these efforts have centered on how to include new lesions either in the tumor burden totals or as an element of progression, whether to use unidimensional or bidimensional measurements of target lesions, and how to proceed after conventional progression is documented, that is, confirmation of PD or continuation until confirmation can be achieved. Creating a consensus on how to evaluate objective response and progression in immunotherapy trials in a consistent and effective manner is needed.
There are a few limitations noted in this analysis. First, the evaluations were based solely on the anti–PD-1/PD-L1 agents submitted to the FDA for approval, and hence all trials considered were positive and no negative trials were available to assess their impact on surrogacy. Second, in some studies, the appearance of new lesions in the patients treated with immunotherapeutic agents were no longer considered as unequivocal progression, and hence the therapy was continued for a longer duration and patients were followed past initial progression. However, the control arms were not followed for similar durations, resulting in a difference in follow-up durations between the treatment and control arms. On the other hand, given the issues in assessing progression using standard RECIST criteria, an extended duration of immunotherapy treatments has facilitated the exploration of the treatment effects of these agents. Adherence to this new model of data collection will give an opportunity to make potential changes to RECIST, which was not possible before due to the nonavailability of assessments.
We also acknowledge that we have pooled all diseases together due to the small number of studies within each disease. The degree of association between PFS and OS may vary among different diseases. In the exploratory analysis, when assessed separately, associations between PFS and OS in patients with melanoma were weaker than associations in patients with NSCLC (at PFS_45 R2 = 0.08 in melanoma studies vs. R2 = 0.59 in NSCLC studies), although sample sizes among specific disease types are small for these methods and both values are still below any that would indicate surrogacy-level association (34). This hypothesis of differences in degree of association among the diseases needs further exploration with the accumulation of more data and experience.
The lack of association could be due to a subgroup of patients benefitting from the therapy. However, at this time, a well-defined subgroup of patients based on molecular biomarkers or clinical features who are more likely to respond has not been identified. In addition, the experience and length of follow-up in these studies are limited, as some of the trial level results included in this study are based on interim analyses. Extended follow-up with enhanced experience in multiple products can provide further exploration of intermediate endpoints as surrogate endpoints in the future.
Given the uncertainty that surrounds the use of PFS as an endpoint to assess survival benefit (35, 36) and the weak associations of PFS, ORR, and modified PFS endpoints with OS, as shown in this analysis, there is a need to further explore novel intermediate endpoints (37) that would better serve as a surrogate endpoint for OS in assessing the clinical benefit of immunotherapeutic agents and accelerate the drug approval process.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Conception and design: S.L. Mushti, F. Mulkey, R. Sridhara
Development of methodology: S.L. Mushti, R. Sridhara
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): S.L. Mushti, R. Sridhara
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): S.L. Mushti, F. Mulkey, R. Sridhara
Writing, review, and/or revision of the manuscript: S.L. Mushti, F. Mulkey, R. Sridhara
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): S.L. Mushti, F. Mulkey, R. Sridhara
Study supervision: S.L. Mushti, R. Sridhara