Abstract
In recent years several clinical studies have investigated deintensified treatments in human papillomavirus (HPV)-associated head and neck squamous cell carcinoma. Two large phase III trials, RTOG 1016 and De-ESCALaTE, which attempted to reduce toxicity by replacing radiotherapy in combination with cisplatin with the use of cetuximab in combination with radiotherapy, recently suggested that radiotherapy + cetuximab leads to inferior survival compared with standard therapy (observed HRs of 1.45 and 5 in RTOG 1016 and De-ESCALaTE), as well as increased rates of locoregional failure. These unexpected results should prompt a careful examination of deintensification trials, both in HPV-associated oropharyngeal cancer and in other contexts. Statistical designs for deintensification studies should be consistent with the study aims of reducing toxicities while maintaining survival nearly identical to the standard of care. We suggest criteria to design future deintensification trials and discuss important operating characteristics, including tradeoffs between power and stringent early stopping rules to reduce the number of patients exposed to inferior treatments. Using retrospective analyses of previous clinical studies, we compared designs with different operating characteristics. As an example, using outcomes data from RTOG 1016 and De-ESCALaTE, we conducted analyses to determine advantages of (i) stringent futility early-stopping rules and of (ii) study designs that leverage both toxicity and efficacy endpoints for interim analyses. We show that increasing the frequency of interim-futility analyses has little impact on power, but the average study duration and number of subjects enrolled before the trial is closed for inferiority can decrease substantially (from 57.8 to 18 months, and from 764 to 645 subjects). Moreover, the number of observed deaths during the study can be reduced by up to 68%.
In recent years, several clinical studies have investigated deintensified treatments in human papillomavirus (HPV)-associated head and neck squamous cell carcinoma (HNSCC). Statistical designs for deintensification studies should be consistent with the study aims of reducing toxicities while maintaining survival nearly identical to the standard of care. We suggest criteria to design future deintensification trials and discuss important operating characteristics, including tradeoffs between power and stringent early stopping rules to reduce the number of patients exposed to inferior treatments. Using retrospective analyses of two large phase III clinical studies in HPV-associated HNSCC, we evaluate designs for future deescalation studies under realistic scenarios, including enrollment rates, and survival distributions for the standard of care and the deintensified treatment. These retrospective analyses illustrate how the suggest criteria can be relevant for future studies.
Introduction
The incidence of human papillomavirus (HPV)-associated head and neck squamous cell carcinoma (HNSCC) is increasing rapidly in many high-income countries (1–4). HPV-associated HNSCC cancers are a distinct subtype as compared with HPV-negative HNSCC (5, 6). In most countries, intensity-modulated radiotherapy in combination with cisplatin (RT+CP) is a current standard of care for advanced oropharyngeal squamous cell carcinoma (7). HPV-associated HNSCC affects a younger and healthier cohort of patients compared with tobacco- and alcohol-associated cancers, and treatment with RT+CP has a high success rate in HPV-associated cancers, with 3-year survival rates close to 90% (8, 9). However, the addition of cisplatin to radiotherapy is associated with a substantial increase in acute and late toxicity compared with radiotherapy alone (10). Young patients with HPV-associated HNSCC therefore experience significant morbidity from treatment with RT+CP, which can adversely impact their quality of life for decades (11).
There is a general agreement for the need of alternative, possibly deescalated, therapies for patients with HPV-associated oropharyngeal cancers with reduced toxicity profiles, without sacrificing response to treatment relative to RT+CP (12). In recent years, several clinical studies have investigated deintensified treatments in HPV-associated HNSCC: for instance OPTIMA (13), E1308 (14), RTOG 1016 (9), and De-ESCALaTE (15). Results of several more studies (PATHOS, ECOG3311, ADEPT, Quarterback trial, and NCT02048020) are eagerly awaited (12, 16).
Unfortunately, both RTOG 1016 and De-ESCALaTE, two large phase III trials that attempted to reduce toxicity by replacing concurrent RT+CP with the use of the EGFR inhibitor cetuximab in combination with radiotherapy (RT+cetuximab), showed inferior overall survival and progression-free survival (OS and PFS) for RT+cetuximab compared with standard therapy, as well as increased rates of locoregional failure. These results were unexpected and should encourage a reevaluation of ongoing deintensification HNSCC trials enrolling similar populations, as well as statistical designs of future deescalation trials.
The Design of Future Deintensification Studies
Deintensification trials may expose study subjects to inferior and toxic treatments. The reported HRs for OS for cetuximab+RT compared with RT+CP were 1.45 and 5 in RTOG 1016 and De-ESCALaTE, respectively, and neither trial provided evidence that the experimental treatment decreases the rates of acute and late toxicities.
Here, we suggest a few relevant characteristics for future deintensification trial designs and desirable operating characteristics.
(i) The study design should balance a large probability of detecting noninferior treatments (i.e., survival similar to the standard of care) with the aim of minimizing the risk of exposing a large number of patients on the trial and future patients to inferior and/or toxic treatments.
(ii) De-escalation trials can consider the use of both toxicity and efficacy coprimary outcomes to evaluate reductions of adverse events and to test noninferior survival of the deintensified treatment. Explicit early stopping rules for both early evidence of inferior survival and insufficient reduction of toxicities should be included in the study design.
(iii) Early stopping rules for inferior survival should be sufficiently stringent (terminate inferior treatments with high >60% probability during the trial) in settings where large OS inferiority margins (i.e., the cutoff to define inferior and noninferior survival) are used. Conservative stopping rules for inferiority and/or insufficient reduction of toxicity may preserve power but can expose a large number of patients, during the study, to toxic and/or inferior treatments.
(iv) Explicit and reproducible early stopping and testing procedures should be provided with the study publication and should be included in the study protocol. This is necessary to set good practices, to define guidelines for future designs, and for informed discussions and comparisons of trial designs.
These criteria need to be considered together with other important factors relevant to clinical trial design, such as the resources available to conduct multiple interim analyses, the variability of treatment effect estimates, the use of appropriate noninferiority margins, and the importance of secondary endpoints. Although beyond the scope of this perspective, these aspects are more comprehensively discussed elsewhere (17–20).
Retrospective Analyses
We use retrospective analyses to evaluate designs for future deescalation studies, with realistic scenarios, including enrollment rates, OS and PFS distributions for the standard of care, and the deintensified treatment. These retrospective analyses illustrate how the above criteria (i–iv) can be relevant for future studies.
The recent De-ESCALaTE study used conservative tests for toxicity (primary endpoint) at interim analyses (with P value thresholds at 0.001 for the null hypothesis of equal toxicity between both arms), and reported a HR between RT+cetuximab and RT+CP of 5.0 for OS (15). No explicit stopping rules for inferior survival have been reported (15). Moreover, the authors reported in the Supplementary Materials and Methods of the study article (15) that there have been several amendments made to the toxicity testing rules, with an extension of the overall sample size of the study due to an unexpectedly high rate of toxicities and a change of statistical testing procedure for toxicity from a Poisson test to a Mann–Whitney U test, although a t test was most recently used in the study article to report results. The publication indicates that the study protocol and analysis plan will be shared (upon request) from January 1st 2020 onwards.
RTOG 1016 was stopped early due to an insufficient evidence of noninferior survival. Interestingly, the estimated HR = 1.45 between RT+cetuximab and RT+CP (nearly identical to the specified noninferiority margin in RTOG 1016) and the upper limit of the HR's confidence interval did not cross the original per-protocol stopping boundary, which in retrospect appeared to be too conservative (9), and an adjusted post hoc stopping rule was instead applied to terminate the study.
Early Stopping Criteria for Futility
Would more stringent early futility stopping criteria prevent patients from being exposed to inferior treatments in future deescalation trials, while maintaining appropriate power of trial designs? Using outcomes data from RTOG 1016 (sampling from digitalized Kaplan–Maier curves extracted with DigitizeIt; ref. 21), we conducted a simulation analysis to determine advantages of stringent futility stopping rules.
We considered the original design of RTOG 1016 (see Gillison and colleagues, 2019; ref. 9), including the null (inferior OS) and alternative (noninferior OS) hypotheses specified using HRs for OS (|{\rm{H}}{{\rm{R}}_{{\rm{OS}}}}$|) equal to 1.45 and 1.00, early stopping rules, observed accrual rate, and the size of 800 randomized patients. By protocol, futility interim analyses (early stopping for the null hypothesis) using PFS started after 2 years and were conducted every 12 months thereafter. Toxicity data are not considered in this initial set of simulations.
We compared this design with an alternative Bayesian design, with identical null and alternative hypotheses (H0 : HROS = 1.45 and HA : HROS < 1.45), similar noninferiority (alternative hypothesis) interim monitoring rules (as in RTOG 1016, OS interim and final analyses after 45, 90, 135, and 180 events to stop the study for the alternative hypothesis), but with more stringent early stopping criteria for inferiority (null hypothesis). Specifically, we used a Bayesian Cox model (22) for PFS to estimate the HR between the experimental and control arm. Inferiority analyses (futility early stopping) using PFS, as described below, are repeated every 1, 3, 6, or every 12 months.
We considered two summaries to define these stopping rules for inferiority, either the Bayesian posterior probability (i) of a HRPFS>1 or (ii) of a HRPFS >1.2. The choice of an appropriate threshold h, such as h = 1 or h = 1.2, depends on the tradeoff in acceptable survival reduction and expected improvement in toxicity compared with standard of care. The trial is stopped at interim analyses if the probability of HRPFS> 1 (HRPFS > 1.2) exceeds a predefined futility parameter.
Assuming HRs (for OS and PFS) equal to the noninferiority margin in RTOG 1016 (1.45), we determined (separately for the posterior probability of HRPFS > 1 and for HRPFS > 1.2) futility thresholds such that the trial would stop early during the trial for inferiority with probability P0 = 0.8 (or 0.9). In practice, the probability P0 (e.g., 0.8 and 0.9) should be selected considering a tradeoff between protecting patients from inferior treatments and maintaining power, and thus may be selected using prior data regarding the expected toxicity reduction associated with the experimental treatment. Larger P0 values will lead to faster termination of inferior experimental therapies but will also have a negative impact on power. If large noninferiority margins are used (for instance a HR = 1.45 as in RTOG 1016) investigators may consider large values 0.8–0.9, whereas smaller values P0 0.6 or 0.7 may be sufficient if smaller noninferiority margins are applied.
We used 10,000 simulations of the two (Bayesian and RTOG 1016) designs across four scenarios. In scenario 1, the OS and PFS distributions of the control and experimental arm are identical to the observed distributions in RTOG 1016. In the remaining three scenarios, control OS and PFS distributions are as observed in RTOG 1016, and HRs between both OS and PFS distributions of the control and experimental arms are equal to HR = 1.45, 1.1, or 1. The Supplementary Materials and Methods provide additional details on the Bayesian design, including the control of type I error rates, and the implementation of the simulation study.
Table 1 and Supplementary Table S1 report selected operating characteristics from the simulation study. Importantly, while increasing the frequency of interim-futility analyses has little impact on power, the average study duration and number of subjects enrolled before the trial is closed for inferiority can decrease (from 57.8 to 30.5–18 months, and from 764 to 645–690 subjects, scenario 1). The average number of deaths during the study is reduced up to 68% in scenario 1. The simulations also show the importance of selecting a suitable statistic (e.g., the posterior probability of HR > h, with h = 1 as in Table 1 or h = 1.2 as in Supplementary Table S1) for inferiority interim analyses. While these statistics perform similarly well in inferiority scenarios, using h = 1 instead of h = 1.2 reduces the probability of stopping for inferiority significantly when the deintensified therapy is noninferior (scenario 4). Supplementary Table S2 presents additional simulation analyses using different sample sizes, which ensure (approximately) 85% and 90% power. Also in these simulations, increasing the frequency of interim analyses has minimal effects on power (<1% power reduction), but can substantially reduce the study duration when the experimental treatment is inferior.
. | Bayesian designa . | . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Target proportion of trials that should be stopped early under HR = 1.45 . | 80% . | 90% . | . | |||||||
Futility-interim analyses every . | 1 month . | 3 months . | 6 months . | 12 months . | 1 month . | 3 months . | 6 months . | 12 months . | Per-protocol RTOG 1016 design . | |
Scenario 1: OS and PFS curves as observed in RTOG 1016 | ||||||||||
H0 not rejected, stopped early for inferiority (%)b | 97.3 | 97.2 | 97.1 | 97.1 | 98.8 | 98.8 | 98.8 | 98.8 | 78.8 | |
H0 not rejected, not stopped early (%)c | 2.4 | 2.5 | 2.6 | 2.6 | 1 | 1 | 1 | 1 | 19.9 | |
H0 rejected (type I error, %)d | 0.2 | 0.3 | 0.3 | 0.3 | 0.1 | 0.1 | 0.2 | 0.2 | 1.3 | |
H0 rejected at final analysis (%)e | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | |
H0 rejected at interim analyses (%)f | 0.2 | 0.3 | 0.3 | 0.3 | 0.1 | 0.1 | 0.2 | 0.2 | 1.3 | |
Average study duration (months)g | 32.1 | 33.4 | 35.1 | 38 | 26.4 | 27.6 | 29 | 31.8 | 57.8 | |
Patients randomized (average) | 579.8 | 599.5 | 622.6 | 662.8 | 499.4 | 520.5 | 545 | 591.1 | 763.9 | |
Deaths on study (average) | 43.6 | 46.6 | 50.3 | 57.4 | 31.2 | 33.6 | 36.7 | 43.1 | 103.1 | |
Scenario 2: OS (and PFS) HR = 1.45 between experimental and control arm | ||||||||||
H0 not rejected, stopped early for inferiority (%)b | 80.8 | 80.5 | 80.2 | 80.2 | 89.8 | 89.9 | 89.8 | 89.8 | 38.7 | |
H0 not rejected, not stopped early (%)c | 18 | 18.2 | 18.5 | 18.5 | 9.5 | 9.3 | 9.3 | 9.3 | 57.9 | |
H0 rejected (type I error)d | 1.2 | 1.3 | 1.3 | 1.4 | 0.7 | 0.8 | 0.9 | 0.9 | 3.4 | |
H0 rejected at final analysis (%)e | 0.4 | 0.4 | 0.5 | 0.4 | 0.2 | 0.2 | 0.2 | 0.2 | 1.3 | |
H0 rejected at interim analysis (%)f | 0.8 | 0.8 | 0.8 | 0.9 | 0.6 | 0.6 | 0.7 | 0.7 | 2.1 | |
Average study duration (months)g | 57.8 | 59.2 | 60.9 | 63 | 44.7 | 45.8 | 47.6 | 50 | 97 | |
Patients randomized (average) | 689.8 | 702.8 | 716.8 | 734.8 | 617.1 | 632 | 652.6 | 680.3 | 763.9 | |
Deaths on study (average) | 89.5 | 92.7 | 96.3 | 101.3 | 66.1 | 68.6 | 72.5 | 78.1 | 103.1 | |
Scenario 3: OS (and RFS PFS) HR = 1.1 between experimental and control arm | ||||||||||
H0 not rejected, stopped early for inferiority (%)b | 13.5 | 12.8 | 12.5 | 12.3 | 25.5 | 24.8 | 24.4 | 23.6 | 0.9 | |
H0 not rejected, not stopped early (%)c | 36.3 | 36.7 | 36.7 | 36.7 | 30.5 | 30.8 | 31.1 | 31.4 | 41.7 | |
H0 rejected (power)d | 50.2 | 50.5 | 50.8 | 51 | 44 | 44.4 | 44.6 | 45 | 57.4 | |
H0 rejected at final analysis (%)e | 19.5 | 19.5 | 19.6 | 19.6 | 16.9 | 16.9 | 16.8 | 16.9 | 22.9 | |
H0 rejected at interim analysis (%)f | 30.8 | 31 | 31.2 | 31.5 | 27.2 | 27.6 | 27.8 | 28.1 | 34.6 | |
Average study duration (months)g | 119.6 | 120.5 | 121 | 121.6 | 108 | 109.3 | 110.3 | 111.6 | 130.4 | |
Patients randomized (average) | 784.3 | 786.8 | 788.6 | 790.1 | 756.8 | 763.2 | 769.3 | 777.2 | 791.2 | |
Deaths on study (average) | 146.9 | 148.2 | 148.8 | 149.9 | 133 | 135 | 136.4 | 138.7 | 152.8 | |
Scenario 4: Identical OS (and PFS) distributions in both arms, HR = 1 | ||||||||||
H0 not rejected, stopped early for inferiority (%)b | 4.1 | 4 | 3.7 | 3.3 | 9.9 | 9.5 | 9.1 | 8.6 | 0.1 | |
H0 not rejected, not stopped early(%)c | 19.7 | 19.7 | 19.8 | 19.8 | 18.6 | 18.7 | 18.8 | 18.8 | 22.7 | |
H0 rejected (power)d | 76.1 | 76.3 | 76.5 | 76.9 | 71.5 | 71.8 | 72.1 | 72.6 | 77.2 | |
H0 rejected at final analysis (%)e | 21.5 | 21.6 | 21.6 | 21.8 | 20.2 | 20.1 | 20.1 | 20.2 | 21.2 | |
H0 rejected at interim analysis (%)f | 54.6 | 54.7 | 54.8 | 55.1 | 51.3 | 51.7 | 52 | 52.5 | 56 | |
Average study duration (months)g | 118.4 | 118.7 | 119 | 119.4 | 113.4 | 113.8 | 114.3 | 114.9 | 121.9 | |
Patients randomized (average) | 792.8 | 794.1 | 795.2 | 796.7 | 778.8 | 781.8 | 784.8 | 787.9 | 799.6 | |
Deaths on study (average) | 140.9 | 141.3 | 141.6 | 142.2 | 134.9 | 135.5 | 136.2 | 137.2 | 145.3 |
. | Bayesian designa . | . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Target proportion of trials that should be stopped early under HR = 1.45 . | 80% . | 90% . | . | |||||||
Futility-interim analyses every . | 1 month . | 3 months . | 6 months . | 12 months . | 1 month . | 3 months . | 6 months . | 12 months . | Per-protocol RTOG 1016 design . | |
Scenario 1: OS and PFS curves as observed in RTOG 1016 | ||||||||||
H0 not rejected, stopped early for inferiority (%)b | 97.3 | 97.2 | 97.1 | 97.1 | 98.8 | 98.8 | 98.8 | 98.8 | 78.8 | |
H0 not rejected, not stopped early (%)c | 2.4 | 2.5 | 2.6 | 2.6 | 1 | 1 | 1 | 1 | 19.9 | |
H0 rejected (type I error, %)d | 0.2 | 0.3 | 0.3 | 0.3 | 0.1 | 0.1 | 0.2 | 0.2 | 1.3 | |
H0 rejected at final analysis (%)e | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | <0.01 | |
H0 rejected at interim analyses (%)f | 0.2 | 0.3 | 0.3 | 0.3 | 0.1 | 0.1 | 0.2 | 0.2 | 1.3 | |
Average study duration (months)g | 32.1 | 33.4 | 35.1 | 38 | 26.4 | 27.6 | 29 | 31.8 | 57.8 | |
Patients randomized (average) | 579.8 | 599.5 | 622.6 | 662.8 | 499.4 | 520.5 | 545 | 591.1 | 763.9 | |
Deaths on study (average) | 43.6 | 46.6 | 50.3 | 57.4 | 31.2 | 33.6 | 36.7 | 43.1 | 103.1 | |
Scenario 2: OS (and PFS) HR = 1.45 between experimental and control arm | ||||||||||
H0 not rejected, stopped early for inferiority (%)b | 80.8 | 80.5 | 80.2 | 80.2 | 89.8 | 89.9 | 89.8 | 89.8 | 38.7 | |
H0 not rejected, not stopped early (%)c | 18 | 18.2 | 18.5 | 18.5 | 9.5 | 9.3 | 9.3 | 9.3 | 57.9 | |
H0 rejected (type I error)d | 1.2 | 1.3 | 1.3 | 1.4 | 0.7 | 0.8 | 0.9 | 0.9 | 3.4 | |
H0 rejected at final analysis (%)e | 0.4 | 0.4 | 0.5 | 0.4 | 0.2 | 0.2 | 0.2 | 0.2 | 1.3 | |
H0 rejected at interim analysis (%)f | 0.8 | 0.8 | 0.8 | 0.9 | 0.6 | 0.6 | 0.7 | 0.7 | 2.1 | |
Average study duration (months)g | 57.8 | 59.2 | 60.9 | 63 | 44.7 | 45.8 | 47.6 | 50 | 97 | |
Patients randomized (average) | 689.8 | 702.8 | 716.8 | 734.8 | 617.1 | 632 | 652.6 | 680.3 | 763.9 | |
Deaths on study (average) | 89.5 | 92.7 | 96.3 | 101.3 | 66.1 | 68.6 | 72.5 | 78.1 | 103.1 | |
Scenario 3: OS (and RFS PFS) HR = 1.1 between experimental and control arm | ||||||||||
H0 not rejected, stopped early for inferiority (%)b | 13.5 | 12.8 | 12.5 | 12.3 | 25.5 | 24.8 | 24.4 | 23.6 | 0.9 | |
H0 not rejected, not stopped early (%)c | 36.3 | 36.7 | 36.7 | 36.7 | 30.5 | 30.8 | 31.1 | 31.4 | 41.7 | |
H0 rejected (power)d | 50.2 | 50.5 | 50.8 | 51 | 44 | 44.4 | 44.6 | 45 | 57.4 | |
H0 rejected at final analysis (%)e | 19.5 | 19.5 | 19.6 | 19.6 | 16.9 | 16.9 | 16.8 | 16.9 | 22.9 | |
H0 rejected at interim analysis (%)f | 30.8 | 31 | 31.2 | 31.5 | 27.2 | 27.6 | 27.8 | 28.1 | 34.6 | |
Average study duration (months)g | 119.6 | 120.5 | 121 | 121.6 | 108 | 109.3 | 110.3 | 111.6 | 130.4 | |
Patients randomized (average) | 784.3 | 786.8 | 788.6 | 790.1 | 756.8 | 763.2 | 769.3 | 777.2 | 791.2 | |
Deaths on study (average) | 146.9 | 148.2 | 148.8 | 149.9 | 133 | 135 | 136.4 | 138.7 | 152.8 | |
Scenario 4: Identical OS (and PFS) distributions in both arms, HR = 1 | ||||||||||
H0 not rejected, stopped early for inferiority (%)b | 4.1 | 4 | 3.7 | 3.3 | 9.9 | 9.5 | 9.1 | 8.6 | 0.1 | |
H0 not rejected, not stopped early(%)c | 19.7 | 19.7 | 19.8 | 19.8 | 18.6 | 18.7 | 18.8 | 18.8 | 22.7 | |
H0 rejected (power)d | 76.1 | 76.3 | 76.5 | 76.9 | 71.5 | 71.8 | 72.1 | 72.6 | 77.2 | |
H0 rejected at final analysis (%)e | 21.5 | 21.6 | 21.6 | 21.8 | 20.2 | 20.1 | 20.1 | 20.2 | 21.2 | |
H0 rejected at interim analysis (%)f | 54.6 | 54.7 | 54.8 | 55.1 | 51.3 | 51.7 | 52 | 52.5 | 56 | |
Average study duration (months)g | 118.4 | 118.7 | 119 | 119.4 | 113.4 | 113.8 | 114.3 | 114.9 | 121.9 | |
Patients randomized (average) | 792.8 | 794.1 | 795.2 | 796.7 | 778.8 | 781.8 | 784.8 | 787.9 | 799.6 | |
Deaths on study (average) | 140.9 | 141.3 | 141.6 | 142.2 | 134.9 | 135.5 | 136.2 | 137.2 | 145.3 |
NOTE: The Bayesian design conducts interim analyses for inferiority (using the probability of HR > 1) every 1, 3, 6, or 12 months, starting from months 1, 3, 6, or 12. The far-right column reports operating characteristics for the RTOG 1016 protocol design.
aThe Bayesian design uses a normal prior with mean zero and unit variance.
bPercentage of simulated trials that were stopped early for inferiority.
cPercentage of simulated trials that were not stopped early and declared the experimental treatment inferior to the control arm at final analysis.
dPercentage of simulated trials that declared the experimental treatment noninferior to the control arm at interim or final analyses (sum ofe andf).
ePercentage of simulated trials that declared the experimental treatment noninferior to the control arm at final analysis.
fPercentage of simulated trials that declared the experimental treatment noninferior to the control arm at interim analyses.
gStudy duration is defined as the period between trial activation and either time of early stopping or final analyses.
We repeated the simulations using the reported Kaplan–Meier curves of De-ESCALaTE (15). Supplementary Table S3 provides (similar to Table 1) a summary of the operating characteristics of the two (Bayesian and RTOG 1016) designs with outcome data generated from digitalized Kaplan–Meier curves of De-ESCALaTE (15). The results are in strong concordance with those in Table 1.
The Use of Efficacy and Toxicity Coprimary Endpoints
We considered the utility of using both efficacy and toxicity endpoints for early stopping (inferior survival or insufficient reduction in toxicity). This is relevant because De-ESCALaTE and RTOG 1016 did not demonstrate reductions in toxicity events. We used a Bayesian model (see Supplementary Materials and Methods) to estimate the average number of adverse events (grade 3–5 events) per patient, EE and EC on the experimental and control arm, and we computed the posterior probability of a reduction in toxicities (|{E_E} - \ {E_C} \lt 0)$| at futility-interim analyses. If this probability falls below a predefined threshold, then the trial is stopped for futility. In addition, the study may be stopped early for inferiority survival (as previously described) or noninferiority (see Supplementary Materials and Methods). The availability of sufficient information to compare toxicities is a prerequisite for early stopping for noninferiority (see Supplementary Materials and Methods).
We considered all combinations of three efficacy and three toxicity scenarios. The average number of toxicities per patient is set either identical in the control and experimental arm, or reduced by 25% or 50% in the experimental arm (toxicity scenarios 1–3). The control and experimental OS and PFS distributions are either as observed in RTOG 1016 (scenario 1), or the HR (for OS and PFS) between the control and the experimental arms are set equal to HR = 1.45 or 1 (scenarios 2–3).
Figure 1 and Supplementary Figs. S1–S6 show selected operating characteristics from the simulation study. The additional stopping rules for toxicity have only moderate impact on the power (Supplementary Fig. S3 for RTOG 1016 and Supplementary Fig. S6 for De-ESCALaTE) if the deintensified therapy reduces toxicity as intended. In contrast, when the deintensified therapy fails to reduce toxicity, then the study is terminated for futility (insufficient reduction of toxicities or inferiority) in >99% of the simulations (Fig. 1). This determines a reduction of up to 54% in the average number of enrollments compared with a Bayesian design without toxicity monitoring (Fig. 1B).
Discussion
In summary, the recently published RTOG 1016 and De-ESCALaTE studies indicate the need for prospective studies before deintensification strategies can be adopted in clinical practice and showed that deintensified experimental treatments in HPV-associated HNSCC can lead to inferior outcomes. With the benefit of hindsight, we do not suggest these impactful studies were flawed. Rather, these studies offer opportunities to evaluate statistical designs for future deintensification trials. Flexible statistical designs, such as Bayesian designs with multiple primary outcomes (23–25), and well-tuned stopping criteria may be beneficial to make clinical studies more consistent with their primary aims, including the control of toxicity events (23) and the control of the potential number of patients that receive an inferior treatment. There is a rich literature on statistical designs that leverage data on multiple endpoints to improve early decisions and minimize the risk of exposing patients to inferior or toxic treatments (23–26).
We used the recently published RTOG 1016 and De-ESCALaTE studies to describe the impact of early stopping parameters in deintensification trials. Evidence of noninferiority and reduced toxicities are captured by established Bayesian models (22, 27). The Bayesian stopping rules are specified using threshold parameters on easy to interpret probabilities of inferiority and toxicity reductions (27). We illustrated the use of simulation studies (such as those summarized in Table 1 and Fig. 1) to choose early stopping parameters and to balance the tradeoff between power and the potential number of patients exposed to a treatment that is inferior or that does not reduce toxicities. Interpretable and well-justified early stopping rules are valuable in the setting of deintensification trials where standard treatment often leads to a favorable outcome in a high proportion of patients. The arguments set forth here could be applied in principle to any deescalation trial or noninferiority trial where overall survival needs to be monitored to expose a minimum number of patients to a potentially inferior treatment.
Disclosure of Potential Conflicts of Interest
L. Trippa is an employee/paid consultant for Galera Therapeutics Inc. J.D. Schoenfeld is an employee/paid consultant for Tilos, LEK, and Catenion, and reports receiving other commercial research support from Merck and Bristol-Myers Squibb. No potential conflicts of interest were disclosed by the other author.
Authors' Contributions
Conception and design: S. Ventz, L. Trippa, J.D. Schoenfeld
Development of methodology: S. Ventz, L. Trippa, J.D. Schoenfeld
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): S. Ventz, L. Trippa
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): S. Ventz, L. Trippa, J.D. Schoenfeld
Writing, review, and/or revision of the manuscript: S. Ventz, L. Trippa, J.D. Schoenfeld
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): S. Ventz
Study supervision: S. Ventz, J.D. Schoenfeld