Abstract
Randomized Phase II oncology trial endpoints for decision making include both progression-free survival (PFS) and change in tumor burden as measured by the sum of longest diameters (SLD) of the target lesions. In addition to observed SLD changes, tumor shrinkage and growth parameters can be estimated from the patient-specific SLD profile over time. The ability of these SLD analyses to identify an active drug is contrasted with that of a PFS analysis through the simulation of Phase II trials via resampling from each of 6 large, Phase II and III trials, 5 of which were positive and one negative. From each simulated Phase II trial, a P value was obtained from 4 analyses—a log-rank test on PFS, a Wilcoxon rank-sum test on the minimum observed percentage change from baseline in SLD, and 2 nonlinear, mixed-effects model analyses of the SLD profiles. All 4 analyses led to approximately uniformly distributed P values in the negative trial. The PFS analysis was the best or nearly the best analysis in the other 5 trials. In only one of the positive studies did the modeling analysis outperform the analysis of the minimum SLD. In conclusion, for the decision to start a Phase III trial based on the results of a randomized Phase II trial of an oncology drug, PFS appears to be a better endpoint than does SLD, whether analyzed through simple SLD endpoints, such as the minimum percentage change from baseline, or through the modeling of the SLD time course to estimate tumor dynamics. Clin Cancer Res; 19(2); 314–9. ©2012 AACR.
Introduction
There are many possible designs of a Phase II oncology trial (1), and randomized Phase II trials are often recommended (2–5). In such randomized trials, patients' tumors are assessed through radiographic imaging, and Response Evaluation Criteria in Solid Tumors (RECIST) criteria (6) applied for determination of tumor burden, tumor response, and progression-free survival (PFS), defined as the earlier of the occurrence of progressive disease or death. Even though overall survival is the gold standard for demonstration of efficacy in Phase III, it is infrequently the primary endpoint in Phase II owing to large sample size and long follow-up requirements (7). Rather, a decision to continue to Phase III with the experimental therapy typically relies on the comparison of response rates and PFS between the randomized treatment arms (7, 8).
There have been recent suggestions that the percentage change from baseline in tumor burden, as measured by the sum of longest diameters (SLD) of the target lesions, can be used for the assessment of comparative efficacy in randomized trials (9–18). The proposals fall into 2 categories. One (9, 10) suggests use of the percentage change from baseline in SLD, denoted as SLD%, as a better endpoint than response rate, which is essentially a dichotomized minimum SLD%. The other category (11–18) models SLD as a function of time. Buyse and colleagues (11) suggest that “a model that uses all tumor size measurements for each patient may be preferable to a model that uses PFS, given that the latter design makes less efficient use of these data.” Modeling of SLD has also been used to
predict overall survival as a function of early SLD% and other variables, which is then used to predict Phase III outcomes (12–14),
assess the response of tumors to time-varying dose levels of an experimental drug (15), and
evaluate the relative efficacy of study treatments through the comparison of fitted tumor growth parameters between treatment arms (16–18).
Tumor burden analyses have been compared with PFS analyses in their ability to lead to the correct decision about the initiation of a Phase III trial. Through the simulation of Phase II studies from 6 large completed trials, it was shown (19) that the analysis of PFS in a randomized Phase II trial generally leads to better decisions about starting a Phase III trial than does the comparison between treatment groups of simple, per-patient tumor burden endpoints, such as the minimum or the last SLD%. However, in the simulation of Phase II studies from 1 positive Phase III trial (20), an endpoint equivalent to SLD% at the first postbaseline assessment performed better than PFS in leading to the correct decision to continue to Phase III, but at the cost of a larger false-positive rate for a negative trial.
Modeling of the SLD time course in a Phase II trial might lead to better decisions than the analysis of these simple SLD endpoints, with perhaps even a consistent advantage over PFS. The present manuscript evaluates this possibility.
Materials and Methods
Phase II trials were simulated from the 6 completed trials listed in Table 1 (19, 21–26). Tumors were assessed at intervals of 6, 8, 9, or 12 weeks, depending on the study. Progressive disease was determined using RECIST criteria for the bevacizumab and erlotinib studies and using World Health Organization (WHO) criteria (27) for the capecitabine trial and modified WHO for the trastuzumab trial. The sum of the longest diameters across the target lesions was determined for the bevacizumab and erlotinib studies and across the marker lesions for the trastuzumab and capecitabine studies. All studies but AVF2119g were positive for PFS, and all studies were positive or nearly so for overall response rate. All studies but AVF2119g and AVF2192g were positive for overall survival. Study AVF2119g did not lead to the successful registration of the drug, and is a negative study for the present analysis.
Studya . | Indication . | Treatments . | Tumor assessment frequency . | Response rates (%) control, experimental . | PFS HR . |
---|---|---|---|---|---|
AVF2107g (21) n = 813/750 | First line CRC | IFL +/− bevacizumab | Every 6 weeks for 24 weeks, then every 12 weeks | 34.8, 44.8; P = 0.004 | 0.54; P < 0.0001 |
AVF2119g (22) n = 462/412 | Second line BC | Capecitabine +/− bevacizumab | Every 6 weeks for 24 weeks, then every 9 weeks | 9.1, 19.8; P = 0.001 | 0.98; P = 0.86 |
AVF2192g (23) n = 209/189 | First line CRC | 5-FU/LV +/− bevacizumab | Every 8 weeks | 15.2, 26.0; P = 0.055 | 0.50; P = 0.0002 |
SO14999 (24) n = 511/477 | BC, 33% first line, otherwise later lines | Docetaxel +/− capecitabine | Every 6 weeks for 48 weeks, then every 12 weeks | 30, 42; P = 0.006 | 0.65b; P = 0.0001 |
BR21 (25) n = 731/525 | Second line NSCLC | Erlotinib vs. placebo | Every 8 weeks | <1, 8.9; P < 0.001 | 0.61; P < 0.001 |
H0648g (26) n = 469/361 | First line HER2+ BC | AC +/− trastuzumab pac +/− trastuzumab | An 8-week assessment, then every 12 weeks | 32, 50; P < 0.001 | 0.51b; P < 0.001 |
Studya . | Indication . | Treatments . | Tumor assessment frequency . | Response rates (%) control, experimental . | PFS HR . |
---|---|---|---|---|---|
AVF2107g (21) n = 813/750 | First line CRC | IFL +/− bevacizumab | Every 6 weeks for 24 weeks, then every 12 weeks | 34.8, 44.8; P = 0.004 | 0.54; P < 0.0001 |
AVF2119g (22) n = 462/412 | Second line BC | Capecitabine +/− bevacizumab | Every 6 weeks for 24 weeks, then every 9 weeks | 9.1, 19.8; P = 0.001 | 0.98; P = 0.86 |
AVF2192g (23) n = 209/189 | First line CRC | 5-FU/LV +/− bevacizumab | Every 8 weeks | 15.2, 26.0; P = 0.055 | 0.50; P = 0.0002 |
SO14999 (24) n = 511/477 | BC, 33% first line, otherwise later lines | Docetaxel +/− capecitabine | Every 6 weeks for 48 weeks, then every 12 weeks | 30, 42; P = 0.006 | 0.65b; P = 0.0001 |
BR21 (25) n = 731/525 | Second line NSCLC | Erlotinib vs. placebo | Every 8 weeks | <1, 8.9; P < 0.001 | 0.61; P < 0.001 |
H0648g (26) n = 469/361 | First line HER2+ BC | AC +/− trastuzumab pac +/− trastuzumab | An 8-week assessment, then every 12 weeks | 32, 50; P < 0.001 | 0.51b; P < 0.001 |
Abbreviations: CRC, colorectal cancer; BC, breast cancer; IFL, irinotecan, 5-FU, leucovorin; LV, leucovorin; AC, anthracycline, cyclophosphamide; pac, paclitaxel.
an = the number of patients randomized in the study over the number of patients remaining for the analysis after data processing. For study AVF2107g, n = 813 excludes 110 patients randomized to a 5-FU/LV + bevacizumab arm that was dropped from the study. Other patient exclusion reasons are described in the text.
bHR for analysis of time to progression.
Patients were included in the analyses if they had a baseline tumor assessment, received at least 1 dose of study medication, and had at least 1 postbaseline tumor assessment. The effect of excluding patients with no postbaseline tumor assessment is addressed in the discussion. While there is large variability in the size, enrollment duration, and patient follow-up in published randomized Phase II trials, the simulated Phase II trials here had intermediate characteristics of 100 patients enrolled uniformly over 1 year, with 50 of these patients selected at random from the control arm of the parent study and 50 from the experimental arm. Each simulated trial included all SLD data available on the selected patients through 6 months after the last patient's enrollment time. If a patient's PFS in a simulated trial was after this 6-month cutoff, then the value was censored at the time of this cutoff.
The comparison of PFS between treatment groups within each replicate was via a log-rank test. The minimum SLD% was compared between treatments with a Wilcoxon rank-sum test. The SLD modeling comparisons followed from a nonlinear, mixed-effects model of SLD in cm (12, 16):
SLDijk = Aij {exp(-Sijtijk) + exp(Gijtijk) − 1} + eijk, [1]
where
i = treatment group, either 0 for control, or 1 for experimental,
j = patient within treatment group,
k = observation within patient j and treatment group i,
Aij = baseline SLD (cm) for patient ij,
tijk = time of the kth tumor assessment for patient ij, with tij1 = 0
Sij = a shrinkage parameter (1/time) for patient ij,
Gij = a growth parameter (1/time) for patient ij,
eijk = a random error (cm).
Distributional assumptions are as follows: log(Aij), log(Sij), and log(Gij) are independent between patients and distributed as multivariate normal with means log(α), log(ξi), and log(γi), respectively, and with an arbitrary variance-covariance matrix, except that the covariance between log(Aij) and each of log(Sij) and log(Gij) is assumed to be 0. This 0 covariance setting is consistent with the findings that SLD% is roughly independent of baseline SLD (19) and that the estimated shrinkage and growth parameters are independent of baseline SLD (18). The eijk are independent and identically distributed as normal, with 0 mean and common variance. Parameter estimation was conducted with Monolix 3.2 (28, 29).
One test for a treatment effect in this model is a Wald test, with 2 degrees of freedom, of H0: ξ0 = ξ1 and γ0 = γ1. A treatment that resulted in generally smaller SLD values after baseline should be better detected with a test with 1 degree of freedom, and for this the following test was used. The shrinkage terms were constrained to be equal between treatment arms, and the difference between arms in growth parameters was tested with a Wald test. Other SLD-based treatment arm comparisons conducted but not presented are described in the discussion.
Because nonlinear, mixed-effects modeling is time consuming per simulated trial, the assessment was limited to 100 simulated Phase II trials per parent study. For the visual assessment of the treatment comparisons via the SLD analyses and the PFS analysis, the empirical cumulative distribution function (CDF) of the 100 2-sided P values was plotted for each analysis method and each parent clinical trial. The empirical distribution function provides the percentage of P value results less than or equal to any given value.
Results
Figure 1 plots the observed SLD versus the model-predicted SLD for 1 Phase II replicate per parent trial. The fit of the model to the data looks satisfactory, and in particular, the choice of an additive error term is supported.
Figure 2 contains the empirical CDFs for the 100 P values from each test for each of the parent studies. For a positive study, a better test for treatment effect has an empirical CDF that rises quickly from the origin, indicating many replicates with small P values, and a poorer test has an empirical CDF closer to the 45° line, which corresponds to a uniform distribution. As an example of reading these curves, approximately 30% of the P values from the minimum SLD% analysis in study AVF2107g were less than or equal to 0.10. Also, approximately 75% of the P values from the PFS analysis in study SO14999 were less than or equal to 0.10. Depending on the P value cutoff chosen for determining a positive study, these graphs can be used to assess the specificity of the analysis method for study AVF2119g and the sensitivity of the method for the other studies.
In the negative AVF2119g study the distribution of the P values from each test is approximately uniform, so that none of the analyses would be expected to lead to greatly increased Type I error rates. However, there was a significant effect of treatment on the response rate in the entire AVF2119g study (Table 1), so the analysis of the minimum SLD% would lead to a slight reduction in specificity for this study. In study AVF2107g, the PFS and modeling analyses performed similarly, with the minimum SLD% performing the worst. Across the other 4 studies, the PFS comparison is either clearly the best (AVF2192g and SO14999) or among the best methods (BR21 and H0648g). Further, in these 4 studies, the minimum SLD% is better than the model-based analyses, although the difference is not great in SO14999.
Discussion
Other tests for treatment effect evaluated, but not presented here, were the following:
A Wald test was conducted of equality between the growth parameters in Model [1] with shrinkage parameters allowed to differ between treatment arms.
Shrinkage and growth term estimates were obtained for each patient in each simulated Phase II trial via a simple nonlinear regression model (16–18) of SLD% on exp(-St) + exp(Gt) − 1. The growth terms were compared between treatment groups with a Wilcoxon rank-sum test.
These 2 tests tended to perform worse than the 2 modeling approaches presented here.
Other versions of Model [1] have been proposed. For example, ref. 12, the growth term can appear linearly as Gijtijk instead of exp(Gijtijk) − 1. Because the linear term is the first-order Taylor series approximation to the exponential term, these 2 models would be expected to perform similarly for data, such as here, where the tumor assessments stop at patient progression.
It is surprising that the modeling approach is as good as a PFS analysis in study AVF2107g but has power only equal to or slightly greater than a Type I error rate in study AVF2192g. The analysis of SLD would tend to improve with more tumor assessments per patient, but AVF2192g was not an outlier in this regard. Specifically, the mean number of tumor assessments in the simulated trials from AVF2192g was 4.2, between the mean values of 3.0 assessments for BR21 and 5.3 assessments for AVF2107g. From Fig. 1, it does appear that the agreement between actual and predicted SLD is the worst in study AVF2192g. However, variability in tumor assessment would also be expected to affect the quality of PFS. In the end, no reason for the poor performance of the modeling approach in AVF2192g was found.
One possible criticism of the assessment here is that the overall PFS analysis was positive for 5 of the 6 studies evaluated, leading to a bias against the SLD analysis. However, these studies were not selected among many studies because of the strength of their PFS results, but because of the ready availability of the per-patient tumor assessments. In spite of the positive PFS results, there was nothing limiting the SLD analyses to be even better.
The most common reasons for patient exclusion from the analyses here (see Table 1) are being randomized but not treated, having nonmeasurable disease at baseline, and having no PFS value and no postbaseline tumor assessment data. These are patients who would reasonably be excluded from a Phase II study analysis. Another reason for patient exclusion, accounting for approximately 40% of excluded patients, is having a PFS value but no postbaseline tumor assessment data. It is difficult to glean the reasons for these outcomes from the databases, but frequently a new lesion was noted, progressive disease recorded, and target lesion measurements left missing. This highlights the importance, when SLD is a key study endpoint, of obtaining complete lesion assessments through progression. Regardless, a repeat of the PFS versus minimum SLD% analysis as in Fig. 2 with the inclusion of these 40% of excluded patients led to little difference in the results.
Although modeling of tumor burden appears not to be as useful as a PFS analysis for the decision to start a Phase III trial, modeling has been used for other purposes. It would be interesting to know how a corresponding PFS analysis for these uses might fare in comparison. For example, the model-predicted SLD% at 8 weeks was used along with other patient characteristics to predict overall survival (OS) for second-line non–small cell lung carcinoma (NSCLC) patients (12). However, PFS could also be used in a model to predict OS, and an approach with an independent postprogression survival term added to PFS (30) might be competitive. Modeling was used to evaluate dose dependence of the growth parameter in a study where dose reductions were applied in some patients (15). PFS could also have been used to evaluate dose dependence by conducting a proportional hazards regression of PFS with recent dose as a time-dependent covariate.
The estimates of the individual patient growth terms were shown to be related to overall survival in renal cell carcinoma (16) and in breast cancer (18). For the 6 studies evaluated here, PFS and its censoring indicator were jointly stronger predictors of overall survival than were the shrinkage and growth parameter estimates from the simple nonlinear regression of SLD% on exp(-St) + exp(Gt) − 1 (details not presented).
A potential advantage of an SLD analysis is that it can be conducted at an early assessment, leading to a faster decision whether to start Phase III. However, enrollment into Phase II trials takes time, and by the time the early assessment is available in the last patient enrolled, PFS will have been determined for many of the patients enrolled early, largely eliminating this potential advantage (19).
One reason why PFS appears to be the better endpoint than SLD in the evaluation of a Phase II trial may be that a patient can progress because of growth of nontarget lesions or the appearance of new lesions (31). These latter 2 reasons would not be reflected in changes in the SLD. Thus, PFS makes greater use of the information captured in the serial tumor assessments than does the SLD. There has been a recent suggestion (32) for a “longitudinal rank-based randomized phase II design, ranking a patient's risk of death, differentially weighting (disease progressions) by type and time of (progressive disease), and percentage change in tumor burden.” The assessment of such an approach awaits further details on its implementation.
In conclusion, for the decision to start a Phase III trial based on the results of a randomized Phase II trial of an oncology drug, PFS appears from assessments performed to date to be the better endpoint than does SLD, whether analyzed through simple endpoints such as the minimum SLD% or through the modeling of its time course. It would be useful for these endpoints and analysis approaches to be assessed in further completed trials to achieve a more definitive overall conclusion or else to define those situations where an SLD analysis might be preferred.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Acknowledgments
The author thanks the National Cancer Institute of Canada for permission to use the BR21 data.