Purpose:

We discuss designs and interpretable metrics of bias and statistical efficiency of “externally controlled” trials (ECT) and compare ECT performance to randomized and single-arm designs.

Experimental Design:

We specify an ECT design that leverages information from real-world data (RWD) and prior clinical trials to reduce bias associated with interstudy variations of the enrolled populations. We then used a collection of clinical studies in glioblastoma (GBM) and RWD from patients treated with the current standard of care to evaluate ECTs. Validation is based on a “leave one out” scheme, with iterative selection of a single-arm from one of the studies, for which we estimate treatment effects using the remaining studies as external control. This produces interpretable and robust estimates on ECT bias and type I errors.

Results:

We developed a model-free approach to evaluate ECTs based on collections of clinical trials and RWD. For GBM, we verified that inflated false positive error rates of standard single-arm trials can be considerably reduced (up to 30%) by using external control data.

Conclusions:

The use of ECT designs in GBM, with adjustments for the clinical profiles of the enrolled patients, should be preferred to single-arm studies with fixed efficacy thresholds extracted from published results on the current standard of care.

Translational Relevance

The use of existing data (real-world data or prior trials) to design and analyze clinical studies has the potential to accelerate drug development processes and can contribute to rigorous examination of new treatments.

Randomized controlled trials (RCT) have been the gold standard for clinical experimentation since the Medical Research Council trial of streptomycin for tuberculosis in 1948 (1). Randomization is the foundation for many statistical analyses and provides a method for limiting systematic bias related to patient selection and treatment assignment (2). Indeed, many failures in phase III drug development may be attributed to overestimating treatment effects from previous early-stage uncontrolled trials (3). Although RCTs reduce the risk of bias compared with single-arm trials, they tend to require larger sample sizes to achieve the targeted power (4), take longer to complete enrollment, and patients have typically a lower propensity to enroll into a RCT than a single-arm trial (5–7).

Many methods have been suggested as a compromise between uncontrolled trials and RCTs (8–11). Recently, availability of data collected from electronic health records (EHR) at scale has increased the interest in using real-world data (RWD; ref. 12) as a “synthetic” or “external” control (12–14). In addition, data from prior clinical trials can be integrated in the design and analysis of single-arm trials (15) rather than using a single published estimate of the standard of care primary outcome distribution to specify a benchmark. Leveraging RWD and prior clinical trials has the potential for controlling for known prognostic factors that cause intertrial variability of outcome distributions. This can reduce bias in single-arm studies, and ultimately could lead to better decision making by sponsors and regulators.

In this article, we illustrate the design and validation of an externally controlled trial (ECT) design to test for therapeutic impact on overall survival (OS) using both RWD and data from prior clinical trials for patients with newly diagnosed glioblastoma (GBM). We compare the ECT design to single-arm trial designs and RCTs and show the benefits and limitations of the ECT approach.

General approach to design and evaluate ECTs

To design an ECT, estimate the sample size for a targeted power, and evaluate relevant operating characteristics, our approach was the following. First, define the patient population for the ECT (in our case GBM). Next, identify a set of prognostic factors associated with the outcome of interest. Finally, specify the control therapy and identify available datasets (trials and RWD) for the control treatment and extract relevant outcomes and patients' characteristics.

As described below, to evaluate the ECT design, the control arm of each study is compared (using adjustment methods) to an external control, which is defined by the remaining available data for patients that received the same control treatment. In these comparisons the treatment effect is zero by construction, which facilitates interpretability and produces bias and variability summaries for ECT's treatment effect estimates, and type I error rates estimates.

If the ECT design maintains (approximately) the targeted type I error rate, we can then determine the sample size required for ECT, single-arm trial, and RCT designs for a targeted probability of treatment discovery at a predefined treatment effect.

The binary variable A indicates the assignment of a patient to the experimental treatment, A = 1, or to the control arm, A = 0 and Y denotes the outcome. We focus on binary endpoints, such as survival at 12 months from enrollment (OS12) and expand the discussion to time-to-event outcomes in the Supplementary Material. The vector X indicates a set of pretreatment patient characteristics. We evaluate whether characteristics X are sufficient to obtain (nearly) unbiased treatment effect estimates or not.

ECT design

The ECT is a single-arm clinical study that uses the trial data (experimental treatment) and external data (control) to conduct inference on treatment effects. More specifically, for a hypothetical randomized study, we estimate the unknown average treatment effect

which is a weighted average of the conditional outcome probabilities weighted with respect to a distribution |{\rm{P}}{{\rm{r}}_X}( x )$| of patients characteristics X. Possible definitions for |{\rm{P}}{{\rm{r}}_X}( x )\$| used by existing adjustment methods are the distribution of patient characteristics X in the single-arm study, |\ {\rm{P}}{{\rm{r}}_{SAT}}( x )$|⁠, or the distribution of X in the external (historical) control, |{\rm{P}}{{\rm{r}}_{HC}}( x )$|⁠. The unknown probabilities |Pr{\rm{(}}Y = 1{\rm{|}}A,\ x)$| do not referee to a particulate parametric statistical model but are unknown model-free quantifies. We considered four adjustment methods, all based on the usual hypothesis of no unmeasured confounders (16) to estimate the unknown average treatment effect TE (1), direct standardization, matching, inverse probability weighting, and marginal structural methods (Supplementary Material; refs. 16, 17). Datasets To develop an ECT design for newly diagnosed GBM, we used data from patients receiving standard temozolomide in combination with radiation (TMZ+RT) from both prior clinical trials and RWD (Table 1). Clinical trial data were from the phase III AVAglio (NCT00943826; ref. 18) trial and two phase II trials (PMID: 22120301 and NCT00441142; refs. 19, 20). RWD was abstracted from patients undergoing treatment for newly diagnosed GBM at the Dana-Farber Cancer Institute (Boston, MA) and UCLA (Los Angeles, CA), and a previously published RWD dataset (21). Table 1. Distribution of pretreatment patient characteristics for the TMZ+RT arm of three clinical studies and three RWE studies StudyAVAglio NCT IDNCT01013285NCT00441142NCT00943826 PubMed IDDFCI-cohortUCLA-cohortPM21135282PM25910950PM22120301PM24552318 Data type RWE RWE RWE Phase II Phase II Phase III Arm TMZ+RT TMZ+RT TMZ+RT TMZ+RT TMZ+RT TMZ+RT Enrollment period 8/06-11/08 2/09-6/11 8/05-2/11 6/09-3/11 Enrollments to SOC 378 305 110 29 16 460 OS events 269 265 89 24 15 344 Age Median 58 57 59 58 59 57 Range 18–91 20–84 20–90 26–73 36–69 18–79 SD 13 13 14 11 11 10 Sex (%) Females 0.43 0.36 0.36 0.45 0.5 0.36 Males 0.57 0.64 0.64 0.55 0.5 0.64 KPS (%) ≤80 0.55 0.39 0.32 0.24 0.44 0.31 >80 0.45 0.61 0.68 0.76 0.56 0.69 Data missing (n27 17 RPA (%) 3 NA 0.22 0.25 NA 0.12 0.16 4 NA 0.42 0.41 NA 0.75 0.61 5 NA 0.34 0.33 NA 0.13 0.23 6 NA 0.02 0.01 NA Data missing (n378 29 Resection (%) Biopsy 0.14 0.22 0.21 0.21 0.09 Sub total 0.47 0.47 0.36 0.48 0.31 0.49 Gross total 0.39 0.31 0.43 0.31 0.69 0.42 Data missing (n12 15 MGMT (%) Unmethylated 0.43 0.71 0.60 0.86 0.43 0.67 Methylated 0.57 0.29 0.40 0.14 0.56 0.32 Data missing (n194 128 40 0.23 IDH1 (%) Wild-type 0.91 0.91 0.98 0.83 NA NA Mutant 0.09 0.09 0.02 0.17 NA NA Data missing (n188 0.46 52 16 344 StudyAVAglio NCT IDNCT01013285NCT00441142NCT00943826 PubMed IDDFCI-cohortUCLA-cohortPM21135282PM25910950PM22120301PM24552318 Data type RWE RWE RWE Phase II Phase II Phase III Arm TMZ+RT TMZ+RT TMZ+RT TMZ+RT TMZ+RT TMZ+RT Enrollment period 8/06-11/08 2/09-6/11 8/05-2/11 6/09-3/11 Enrollments to SOC 378 305 110 29 16 460 OS events 269 265 89 24 15 344 Age Median 58 57 59 58 59 57 Range 18–91 20–84 20–90 26–73 36–69 18–79 SD 13 13 14 11 11 10 Sex (%) Females 0.43 0.36 0.36 0.45 0.5 0.36 Males 0.57 0.64 0.64 0.55 0.5 0.64 KPS (%) ≤80 0.55 0.39 0.32 0.24 0.44 0.31 >80 0.45 0.61 0.68 0.76 0.56 0.69 Data missing (n27 17 RPA (%) 3 NA 0.22 0.25 NA 0.12 0.16 4 NA 0.42 0.41 NA 0.75 0.61 5 NA 0.34 0.33 NA 0.13 0.23 6 NA 0.02 0.01 NA Data missing (n378 29 Resection (%) Biopsy 0.14 0.22 0.21 0.21 0.09 Sub total 0.47 0.47 0.36 0.48 0.31 0.49 Gross total 0.39 0.31 0.43 0.31 0.69 0.42 Data missing (n12 15 MGMT (%) Unmethylated 0.43 0.71 0.60 0.86 0.43 0.67 Methylated 0.57 0.29 0.40 0.14 0.56 0.32 Data missing (n194 128 40 0.23 IDH1 (%) Wild-type 0.91 0.91 0.98 0.83 NA NA Mutant 0.09 0.09 0.02 0.17 NA NA Data missing (n188 0.46 52 16 344 Abbreviations: IDH1, isocitrate dehydrogenase 1; KPS, Karnofsky performance status; MGMT, O6-methylguanine-DNA methyltransferase; RPA, recursive partitioning analysis. Model-free evaluation We evaluated the ECT design by mimicking the comparison of an ineffective experimental arm to an external control. Hypothetical ECT experimental arms were generated from data of the TMZ+RT arm of one of the studies in Table 1 using the following model-free procedure. For each study, we iterated the following steps: • (i) We randomly selected n patients (without replacement) from the TMZ+RT arm of the study and use the clinical profiles X and outcomes Y of these patients as experimental arm of the ECT. Here, n is the number of enrolled patients. • (ii) The TMZ+RT arms of the remaining five studies (Table 1) were used as external control. • (iii) We estimated the treatment effect TE comparing the experimental arm (step i) and the external control (step ii) using one of the adjustment methods (Supplementary Material), and tested the null hypothesis of no-benefit, |{H_0}:\ TE \le 0$|⁠, at a targeted type I error rate of 10%.

We repeated steps (i–iii) with different sets of n randomly selected patients. Here, n is less or equal to the size of the TMZ+RT arm of each GBM study.

A similar model-free procedure allows one to evaluate the operating characteristics of ECTs in presence of positive treatment effects, by reclassifying in step (i), randomly and with fixed probability, negative individual outcomes Y into positive outcomes. Section S1 of the Supplementary Material presents a detailed description of this procedure.

Comparison, ECT, single-arm, and RCT designs

We compared the ECT to single-arm and RCT designs. We used the following criteria:

• (i)  bias and variability of treatment effect estimates,

• (ii)  deviations of type I error rates from targeted control of false positive results, and

• (iii)  the sample size to achieve a targeted power.

In a single-arm trial, an estimate of the proportion |{\pi _E}$| of patients surviving, at a specific time point, say 12 months (OS-12), is compared with a historical estimate for the standard-of-care |{\pi _{HC}}$|⁠, typically the result of a prior trial. Here, |{\pi _{HC}}\ $| can be expressed as a weighted average of the OS-12 probability for a patient with profile X, which is averaged over the study-specific distribution |{{\rm{P}}_{H{\rm{C}}}}( {\rm{X}} )$| of patient characteristics,

Similarly, the parameter |{\pi _E}$| can be written as a weighted average of the probability |Pr{\rm{(}}Y = 1{\rm{|}}A = 1,\ x)\$| over the distribution |{{\rm{P}}_{SAT}}( {\rm{X}} )$| of X in the single-arm trial. If the distributions |{\rm{P}}{{\rm{r}}_{SAT}}( {\rm{X}} )$| and |{\rm{P}}{{\rm{r}}_{H{\rm{C}}}}( {\rm{X}} )$| differ substantially, then |{\pi _{HC}}\$| and |{\pi _E}\ $| are not comparable, treatment effect estimates can be biased, and type I error rates can deviate from the targeted value. If the patient's prognostic profiles in the single-arm study are favorable compared with the study used as benchmark, then the type I error probability tends to be above the targeted α-level, and vice versa. In the latter case, the power decreases. In an RCT, patients are randomized to the control and experimental arm, with patient characteristics, on average, equally distributed between arms, reducing the risk of bias compared with single-arm trial designs. Limitations of the single-arm design We illustrate the bias and type I error deviation associated with single-arm trials using an example for a hypothetical ineffective experimental treatment in a disease with one known prognostic biomarker X. We assume, for each patient, identical outcome probabilities under the experimental and control treatment. Figure 1A shows the difference |( {{\pi _E} - {\pi _{HC}}} )$| when the prevalence of the biomarker varies between |{{\rm{P}}_{SAT}}( {X = 1} ) = 0.1$| and 0.9 for different correlation levels between the outcome Y and the biomarker X. Even with a moderate association between the biomarker and the outcome, the differences between the distributions |( {{{\rm{P}}_{H{\rm{C}}}},\ {{\rm{P}}_{SAT}}} )$| result in bias and departures from the intended 10% type I error rate (Fig. 1B).

Figure 1.

Bias |(\!\!{\ {\pi _{ SAT}} - {\pi _{HC}}} )\ $|and deviations from a targeted type I error rate of 10%. Bias is due to different patient populations in the single-arm trial (SAT) and in the historical study. A single binary characteristic (X = 1 or X = 0) correlates with the binary outcome Y, and the experimental treatment has no therapeutic effect |Pr( {Y{\rm{|}}X,A = 1} ) = Pr(Y|X,A = 0)$|⁠. The characteristic X = 1 was present in 50% of the patients in the historical control arm, |{ P_{HC}}( { X = 1} ) = 0.5$|⁠. A, The difference |( {{\pi _{ SAT}} - {\pi _{HC}}}\!)$| for a range of probabilities |{ P_{SAT}}( { X = 1} )$|⁠. We consider four levels of association between X and Y; |(\Pr ( { Y{\rm{|}}X = 1, A = a} )$| and |P(Y|X = 0,A = a)$| equal either to (0.3, 0.9), (0.4, 0.8), (0.5, 0.7), or (0.6, 0.6). B, For a SAT (with standard z-test for proportions, |{H_0}:\ {\pi _{SAT}} \le {\pi _{HC}}$|⁠) how the false positive rate (y-axis) of the design deviates without adjustments from the targeted type I error rate of 10% when the prevalence |{ P_{SAT}}( { X = 1} ) = 0.3,\ 0.5,\ 0.6, {\rm or}\ 0.8$|⁠. We consider different sample sizes (x-axis) of the SAT. In B, we assume to know the parameter |{\pi _{ HC}}$|⁠.

Figure 1.

Bias |(\!\!{\ {\pi _{ SAT}} - {\pi _{HC}}} )\ $|and deviations from a targeted type I error rate of 10%. Bias is due to different patient populations in the single-arm trial (SAT) and in the historical study. A single binary characteristic (X = 1 or X = 0) correlates with the binary outcome Y, and the experimental treatment has no therapeutic effect |Pr( {Y{\rm{|}}X,A = 1} ) = Pr(Y|X,A = 0)$|⁠. The characteristic X = 1 was present in 50% of the patients in the historical control arm, |{ P_{HC}}( { X = 1} ) = 0.5$|⁠. A, The difference |( {{\pi _{ SAT}} - {\pi _{HC}}}\!)$| for a range of probabilities |{ P_{SAT}}( { X = 1} )$|⁠. We consider four levels of association between X and Y; |(\Pr ( { Y{\rm{|}}X = 1, A = a} )$| and |P(Y|X = 0,A = a)$| equal either to (0.3, 0.9), (0.4, 0.8), (0.5, 0.7), or (0.6, 0.6). B, For a SAT (with standard z-test for proportions, |{H_0}:\ {\pi _{SAT}} \le {\pi _{HC}}$|⁠) how the false positive rate (y-axis) of the design deviates without adjustments from the targeted type I error rate of 10% when the prevalence |{ P_{SAT}}( { X = 1} ) = 0.3,\ 0.5,\ 0.6, {\rm or}\ 0.8$|⁠. We consider different sample sizes (x-axis) of the SAT. In B, we assume to know the parameter |{\pi _{ HC}}$|⁠.

Close modal

TMZ+RT in newly diagnosed GBM

The standard of care of TMZ+RT for newly diagnosed GBM was established in 2004 based on results from the EORTC-NCIC CE.3 (22). Subsequently, nine additional trials enrolled patients on TMZ+RT control arms between 2005 and 2013 (Supplementary Table S1). The majority of single-arm studies used the reported results of EORTC-NCIC CE.3 as historical benchmark (Supplementary Table S1). Sample sizes of the TMZ+RT control arms in the RCTs varied between 16 (19) patients and 463 (18) patients. Supplementary Figure S1 shows reported Kaplan–Meier estimates, median OS, and OS-12 for the TMZ+RT arms. Point estimates for OS-12 varied between 0.56 and 0.81 across studies, and between 13.2 and 21.2 months for median OS.

Prognostic variables

Through a literature review, we identified prognostic factors associated with survival in newly diagnosed GBM (23–26). A Cox regression analysis, stratified by trial and treatment arm, was used to quantify association of covariates with OS (Table 2). On multivariable analyses, age (HR 1.03; P < 0.001), male gender (HR 1.15, P = 0.012), KPS >80 (HR 0.78; P < 0.001), gross total resection versus biopsy (HR 0.62; P < 0.001), subtotal resection versus biopsy (HR 0.82; P = 0.028), MGMT promoter methylation (HR 0.46; P < 0.001), and IDH1 (HR = 0.52; P = 0.01) showed association with OS.

Table 2.

Pretreatment patient characteristics associated with OS estimated HRs in univariable and multivariable stratified Cox regression models. The baseline hazard rate was stratified by study and treatment arm.

ModelUnivariableMultivariable
variablesHRPHRP
Age
Linear 1.02 <0.001 1.03 <0.001
Sex (ref. female)
Male 1.17 0.004 1.15 0.012
KPS (ref. ≤ 80)
>80 0.64 <0.001 0.78 <0.001
RPA (ref. class 3)
Class 4 1.50 <0.001 0.90 0.327
Class 5 2.29 <0.001 1.04 0.734
Class 6 7.10 <0.001 2.20 0.059
Resection (ref. biopsy)
Sub total 0.78 0.001 0.82 0.028
Gross total 0.56 <0.001 0.62 <0.001
MGMT (ref. unmethylated)
Methylated 0.47 <0.001 0.46 <0.001
IDH1 (ref. wild-type)
Mutant 0.35 <0.001 0.52 0.010
ModelUnivariableMultivariable
variablesHRPHRP
Age
Linear 1.02 <0.001 1.03 <0.001
Sex (ref. female)
Male 1.17 0.004 1.15 0.012
KPS (ref. ≤ 80)
>80 0.64 <0.001 0.78 <0.001
RPA (ref. class 3)
Class 4 1.50 <0.001 0.90 0.327
Class 5 2.29 <0.001 1.04 0.734
Class 6 7.10 <0.001 2.20 0.059
Resection (ref. biopsy)
Sub total 0.78 0.001 0.82 0.028
Gross total 0.56 <0.001 0.62 <0.001
MGMT (ref. unmethylated)
Methylated 0.47 <0.001 0.46 <0.001
IDH1 (ref. wild-type)
Mutant 0.35 <0.001 0.52 0.010

The prevalence of these factors varied across studies, from 0.5 to 0.64 for male gender, from 0.45 to 0.76 for KPS >80, and 0.14 to 0.57 for MGMT methylated status. Minimum (maximum) age varied between 18 and 36 (68 and 91) years across trials, and also resection showed noticeable variation across trials. We selected all five variables (age, gender, KPS, MGMT, extent of resection) for adjustments in the ECT design.

Figure 2.

Treatment effect estimates of the ECT design. For each study, the TMZ+RT arm was used as ECT's experimental arm and (after adjustment for patients' characteristics) compared with the TMZ+RT arms of the remaining four studies. A, For each of the study, covariate adjusted treatment effect estimates (point estimates and 90% CI, n equals to the arm-specific size). B, Treatment effect estimates (average value, 5th and 95th percentile) across 10,000 subsamples of n = 46 patients using different adjustment methods. We consider direct standardization, matching, inverse-probability weighting, and marginal structural models (DSM, PS-M, IPW, MSM). For IPW and MSM, we use different reference distributions |{\rm{P}}{{\rm{r}}_X}( x )$| (see expression 1) of pretreatment characteristics X. C, The distribution of treatment effect estimates of the ECT (blue line) and RCT (black line) across subsamples of n = 46 patients. Close modal Figure 3. Model-based evaluation of the type I error and power for RCT, ECT, and single-arm trial (SAT) designs for an overall study sample size of n = 20, …, 160 patients. In the model-based approach (Supplementary Material), we sampled baseline characteristics X from the five studies in Table 1, and generated outcomes Y from models |Pr(Y|X,A)$|⁠. A, For all studies, the type I error rates of RCT, ECT, and SAT designs at different overall sample sizes. Different line types (solid, dashed, dotted, etc.) indicate different studies (Table 1). B–F, For each study, the power of RCT, SAT, and ECT designs, and sample size to achieve 80% power (dotted vertical lines). In A, the SAT experimental outcomes have been generated as in the ECT simulations, but outcomes Y are directly compared with the EORTC-NCIC CE.3 study estimates, without adjustments for different distributions of patients' characteristics. For RCTs, half of the randomly selected profiles X are used to define the experimental arm and the remaining half defines the control arm. Two-group (RCT) and single-group (SAT) z-tests for proportions were used for testing. To compute the power in B–F of the SAT, we assumed that the historical control benchmark |{\pi _{HC}}$| was correctly specified. Figure 3. Model-based evaluation of the type I error and power for RCT, ECT, and single-arm trial (SAT) designs for an overall study sample size of n = 20, …, 160 patients. In the model-based approach (Supplementary Material), we sampled baseline characteristics X from the five studies in Table 1, and generated outcomes Y from models |Pr(Y|X,A)$|⁠. A, For all studies, the type I error rates of RCT, ECT, and SAT designs at different overall sample sizes. Different line types (solid, dashed, dotted, etc.) indicate different studies (Table 1). B–F, For each study, the power of RCT, SAT, and ECT designs, and sample size to achieve 80% power (dotted vertical lines). In A, the SAT experimental outcomes have been generated as in the ECT simulations, but outcomes Y are directly compared with the EORTC-NCIC CE.3 study estimates, without adjustments for different distributions of patients' characteristics. For RCTs, half of the randomly selected profiles X are used to define the experimental arm and the remaining half defines the control arm. Two-group (RCT) and single-group (SAT) z-tests for proportions were used for testing. To compute the power in B–F of the SAT, we assumed that the historical control benchmark |{\pi _{HC}}$| was correctly specified. Close modal Model-free ECT evaluation Figure 2A shows ECT treatment effect estimates for each of the remaining 5 studies. Treatment effect estimates, in all cases, were close to zero. In comparison, a single-arm trial design, with the EORTC-NCIC CE.3 used as historical benchmark (Supplementary Table S1), would lead to overestimation of treatment efficacy. Next, we generated ECTs for a fixed sample size of n = 46. The sample size was selected for a targeted power of 80% and 10% type I error rate, for a single-arm trial (one-sided binomial test) with OS-12 improvement from |{\hat{\pi }_{HC}} = 61\%$| to 76%, with |{\hat{\pi }_{HC}}$| from the EORTC-NCIC CE.3 (22) study. Because the TMZ+RT arm of PM22120301 and NCT00441142 had only 16 patients and 29 patients, we could not use these studies to generate ECTs with size n = 46. Figure 2B and C show the results of nearly identical analyses as Fig. 2A across 10,000 subsamples of n = 46 randomly selected patients using four adjustment methods—direct standardization (DSM), matching (M), inverse probability weighting (IPW), and marginal structural model (MSM). For IPW and MSM, we used multiple reference distribution |{{\rm{P}}_X}( x )$| (see expression 1; refs. 27–29). We used IPW-T in the analyses for Fig. 2A and C. Supplementary Fig. S3 shows ECT treatment effect estimates obtained by adjustment using different sets of prognostic characteristics.

Figure 2C shows the distribution of treatment effect estimates across the generated trials for ECT (in blue) and RCT (in black). The RCT data were obtained by randomly dividing the simulated single-arm trial dataset into two parts of 23 patients, which are labeled as control and experimental arms. With identical sample size (n = 46), the assignment of all patients in ECT to the experimental arm results in lower variability of the treatment effect estimates compared with the RCT. The empirical type I error rate (targeted value 10%) across generated ECTs (model-free analysis with n = 46) was 9.1%, 6.1%, and 8.6% for the TMZ+RT arm of the AVAglio, the DFCI cohort, and the UCLA cohort, compared with 40.7%, 21.9%, and 40.5% for the single-arm trial design [historical benchmark: 0.611 reported in EORTC-NCIC CE.3 (22)] and a targeted value of 10%, respectively. The latter estimates, well above the 10% target, are consistent with different outcome distributions under TMZ+RT observed in these three studies compared with the EORTC-NCIC-CE.3 study. Indeed, underestimation of TMZ+RT's efficacy in single-arm trial translates into inflated type I error rates.

Model-based ECT evaluation

Figure 3A and B–F report model-based type I error rates and power for hypothetical RCTs, single-arm trials, and ECTs with sample size ranging between 20 and 160. We used the pretreatment characteristics X from the five studies to evaluate designs using a model-based approach (Supplementary Material), which consists in sampling baseline characteristics X from the studies and generating outcomes Y from models |P(Y|X,A)$|⁠. We specify |P(Y|X,A)$| with a logistic model, obtained by fitting the TMZ+RT data from all five studies combined, with or without the addition of a positive treatment effect.

Both the RCT and ECT have false positive rates, across simulated trials, close to the targeted value of 10% for all five studies (Fig. 3A). The single-arm trial design (historical benchmark: EORTC-NCIC CE.3) overestimates treatment effects, and presents inflated type I error rates, 21%–59% for n = 30 and 30%–83% for n = 60 patients (Fig. 3A).

The reported power in Fig. 3B–D corresponds to a scenario with improvement in OS-12 equal to an odds ratio of 2.6. For example, with X corresponding to a male patient, age 59 (median age in the studies TMZ+RT arms) with biopsy, KPS ≤ 80, and negative MGMT status |Pr( {Y = 1{\rm{|}}A = 1,\ X} ) - Pr( {Y = 1{\rm{|}}A = 0,\ X} ) = 0.15$|⁠. The RCT requires more than 139 patients (139, 140, 137 154, 150 patients, X-distributions of DFCI, UCLA cohorts, AVAglio, PM22120301, and NCT00441142) to achieve a power of 80% at 10% type I error rate. In contrast, the ECT requires between 34 and 40 patients (34, 34, 34, 40, and 37 patients) to achieve the 80% power. Clinical researchers have discussed and debated the relative merits of single-arm versus randomized trial design (4, 30–36). Single-arm trials have obvious attraction for patients, could potentially be smaller, and are logistically easier to employ as pragmatic trials. The associated increased risk for bias, however, could lead to poor therapeutic development and regulatory decision-making. Overly optimistic analysis and interpretation can result in large negative phase III trials. Negatively biased results can cause discontinuation in the development of therapies with positive effects. This potential for bias is less pronounced for endpoints with minimal variation under the control. For example, single-arm designs for monotherapies using tumor response as an endpoint have low risk for inflated type I error (37). Evaluation of therapeutic combinations and use of endpoints such as PFS and OS are more complicated with increased risk for biased results; however, randomization is the best way to limit this bias. But alternative methodologies could improve on single-arm designs without the limitations of setting up a randomized control. Historic benchmarks in single-arm trial designs have two major problems. The first problem is ignoring discrepancies of the estimated survival functions across trials due to population differences. Controlling for known prognostic factors has been shown to mitigate this issue somewhat (15). In addition, single-arm trials by design compare a single point of a PFS or OS curve, for example OS-12, to a benchmark. Such approaches do not leverage the power of statistical analyses that incorporate all time-to-event data, including censoring. The use of external control arms can address both these limitations. Clearly, ECTs can increase power compared with RCTs by leveraging additional information from outside of the trial rather than committing resources to an internal control. In our analysis in GBM, the ECT reached nearly the same power as a single-arm trial that specifies the correct historical response rate (zero bias) of the standard of care. This efficacy gain of ECTs will not necessarily be the same in other disease settings, and will depend on the size of the external control, the number of patient characteristics X that are required to control for confounding, and the variability of these patient characteristics across and within studies. By standard statistical arguments (38–40), in settings that are favorable to the ECT design (large external control cohort, a few relevant covariates, and no unmeasured confounders), the sample size for an RCT to match uncertainty summaries of ECTs such as the variance of a treatment effect estimate |\widehat {TE}$| or the length of a CI for TE is approximately four times larger than the sample size of an ECT. The major question is whether this comes with the downside of increasing type I errors. In our analysis in newly diagnosed GBM, the ECT's type I error rates were comparable to RCTs, whereas single-arm trials showed significantly inflated type I error rates. These results are in concordance with previous findings of a meta-analysis in GBM of Vanderbeek and colleagues, 2018 (41), which associate single-arm phase II trial in GBM during a 10-year period with underestimation of the TMZ+RT efficacy. The majority of single-arm studies used the EORTC-NCIC CE.3 trial as historical benchmark. Underestimation of the control efficacy translates into inflated type I error rates.

Another key question is whether our results were limited to newly diagnosed GBM or generalizable. We can consider three questions when evaluating the use of an ECT for a given experimental treatment: (i) are there time trends in the outcome under the control; (ii) are the available prognostic factors sufficient to explain most of the variation in the outcome distributions across trials; and (iii) is there evidence of significant latent unobserved confounding after controlling for known prognostic factors. Our evaluation of ECTs in GBM required use of an entire collection of datasets, to address these questions, and there is not a simple strategy to determine how other diseases or indications might compare.

The first step of our validation analyses was the selection of potential confounders. On the basis of previous recommendations, see for example Greenland, 2008 (42), before data collection and analyses, we identified a list of potential confounders through a review of the GBM literature. In selecting the set of patient characteristics, we tried to be as comprehensive as possible, because the exclusion of confounders can compromise the ECT performance. Sensitivity summaries of the validation analysis, similar to Supplementary Fig. S2, can then be computed to illustrate variations of the estimated ECT performances when smaller sets of variables are used for adjustments.

A strength of our evaluation was the use of both data from prior clinical trials and RWD. Most discussions of external controls tend to focus on one or the other, but each has strengths and weaknesses. Clinical trial data are more meticulously collected, resulting in more standardized definitions and entry. This was evident in our dataset where several covariates from RWD datasets were characterized by missing data. This problem can be mitigated through the use of RWD datasets at scale rather than from a single institution. Furthermore, we initially found erroneous treatment effects due to differences in the definitions of the time-to-event variables (index dates) in our RWD compared with the remaining RCTs. Although this is easily correctable in our example, care must be taken to define endpoints in RWD (43). Conversely, RWD has the advantage of being generated during routine clinical care, which is less costly, potentially available at larger scale, and more contemporary. Because each kind of data have benefits and limitations, leveraging both has value for ECT generation.

A limitation of our study is the relatively small number of datasets that we used to evaluate the ECT design. The higher the number of available trials and cohorts, the more precise the estimation of ECT type I error rates and other key operating characteristics can be. The extension of our results in the future for GBM and the generation of ECTs for other diseases will undoubtedly be aided by clinical trial data sharing efforts like Project Data Sphere (44), Vivli (45), YODA (46), and the availability of RWD at scale from groups like Flatiron Health, Tempus, and ASCO through CancerLinQ (47).

The validation procedure that we used builds on clinical studies that included the same control (in our case TMZ+RT) with the aim to evaluate the use of ECT for future trials. A limitation of the procedure is the identical use of all the available studies. Potentially relevant differences, such as the year when each study started the enrollment, are not considered. Nonetheless, a simple and interpretable scheme for validation has the advantage of being robust to selection bias. We included all studies for which we could access patient-level data. However, after the utility of the ECT design has been rigorously evaluated for a specific disease setting based on interpretable and robust procedures, it becomes appropriate to refine the set of studies used as external control. This could be done for example by multistudy analyses to identify studies with a risk of inducing bias in the ECT results if used as external control. Moreover, Bayesian models that incorporate differences across studies (48, 49) could be used to compute treatment effects estimates and credible intervals for ECTs.

An extension of our validation framework could include the evaluation of procedures to use external control data in the analysis and interpretation of RCTs. In some cases, external data may contribute to more accurate treatment effects estimates in RCTs. Statistical methods for the use of external control data in RCTs, for example early-stopping rules, will require, similar to ECT designs, validation studies before their implementation in clinical trials.

Conclusions

ECT designs have the potential to improve the evaluation of novel experimental agents in clinical trial and accelerate the drug development process by leveraging external data. Challenges in the use external data compared with standard RCTs include (i) the identification of a comprehensive list of potential confounders X for adjustments, (ii) access to a large set of RWD datasets or completed RCT's to create a library of studies for robust validation analyses, (iii) availability of patient-level data and possible missing data problems, (iv) coherent definitions and consistent measurements of patient characteristics and outcomes across datasets, (v) possible trends in calendar time in the distributions of the outcomes under the control treatment due to improved clinical practice and (vi) the use of robust statistical procedures to evaluate ECT designs in comparison with traditional single-arm and RCT designs.

Here, we introduced a simple algorithm to evaluated operating characteristics such as bias, variability of treatment effect estimates, and type I error rates of ECT designs. We considered different ECT designs that use distinct adjustment methods. Our results indicate that the ECTs constitute a useful alternative to standard single-arms trials and RCTs in GBM, which could significantly reduce the current false positive rates (41) of single-arm phase II GBM trials.

A. Lai reports receiving commercial research grants from Genentech/Roche and is a consultant/advisory board member for Merck. T. F. Cloughesy holds ownership interest (including patents) in Notable Labs; is listed as a co-inventor on a patent regarding the composition of matter for cancer therapy owned by UCLA; and is a consultant/advisory board member for Bayer, Del Mar Pharmaceuticals, Tocagen, Kryopharm, GW Pharma, Klyatec, Abbvie, Boehringer Ingelheim, VBI, Decephere, VBL, Agios, Merck, Roche, Genocea, Celgene, Puma, Lilly, Bristol-Myers Squibb, Cortice, Wellcome Trust, Novocure, Novogen, Boston Biomedical, Sunovion, Human Longevity, Insys, ProNai, Pfizer, Notable Labs, and MedQia. B. M. Alexander is an employee of Foundation Medicine, Inc., and holds ownership interest (including patents) in Hoffman La-Roche. No potential conflicts of interest were disclosed by the other authors.

Conception and design: S. Ventz, P.Y. Wen, L. Trippa, B.M. Alexander

Development of methodology: S. Ventz, L. Trippa, B.M. Alexander

Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): S. Ventz, A. Lai, T.F. Cloughesy, P.Y. Wen, B.M. Alexander

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): S. Ventz, A. Lai, T.F. Cloughesy, L. Trippa, B.M. Alexander

Writing, review, and/or revision of the manuscript: S. Ventz, A. Lai, T.F. Cloughesy, P.Y. Wen, L. Trippa, B.M. Alexander

Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): S. Ventz, B.M. Alexander

Study supervision: S. Ventz, B.M. Alexander

This work was supported by the Burroughs Wellcome Innovations in Regulatory Science Award.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1.
Streptomycin treatment of pulmonary tuberculosis: a medical research council investigation
.
Br Med J
1948
;
2
:
769
82
.
2.
Armitage
P
.
.
Int J Epidemiol
2003
;
32
:
925
8
.
3.
Seruga
B
,
Ocana
A
,
Amir
E
,
Tannock
IF
.
Failures in phase III: causes and consequences
.
Clin Cancer Res
2015
;
21
:
4552
60
.
4.
Gan
HK
,
Grothey
A
,
Pond
GR
,
Moore
MJ
,
Siu
LL
,
Sargent
D
.
Randomized phase II trials: inevitable or inadvisable?
J Clin Oncol
2010
;
28
:
2641
7
.
5.
Eborall
HC
,
Stewart
MCW
,
Cunningham-Burley
S
,
Price
JF
,
Fowkes
FGR
.
Accrual and drop out in a primary prevention randomised controlled trial: qualitative study
.
Trials
2011
;
12
:
7
.
6.
Donovan
J
,
Mills
N
,
Smith
M
,
Brindle
L
,
Jacoby
A
,
Peters
T
, et al
Quality improvement report: improving design and conduct of randomised trials by embedding them in qualitative research: ProtecT (prostate testing for cancer and treatment) study. Commentary: presenting unbiased information to patients can be difficult
.
BMJ
2002
;
325
:
766
.
7.
Featherstone
K
,
Donovan
JL
.
“Why don't they just tell me straight, why allocate it?” The struggle to make sense of participating in a randomised controlled trial
.
Soc Sci Med
2002
;
55
:
709
19
.
8.
Berry
DA
.
The brave new world of clinical cancer research: adaptive biomarker-driven trials integrating clinical practice with clinical research
.
Mol Oncol
2015
;
9
:
951
9
.
9.
Pocock
SJ
.
The combination of randomized and historical controls in clinical trials
.
J Chronic Dis
1976
;
29
:
175
88
.
10.
Viele
K
,
Berry
S
,
Neuenschwander
B
,
Amzal
B
,
Chen
F
,
Enas
N
, et al
Use of historical control data for assessing treatment effects in clinical trials
.
Pharm Stat
2014
;
13
:
41
54
.
11.
Thall
PF
,
Simon
R
.
Incorporating historical control data in planning phase II clinical trials
.
Stat Med
1990
;
9
:
215
28
.
12.
Corrigan-Curay
J
,
Sacks
L
,
Woodcock
J
.
Real-world evidence and real-world data for evaluating drug safety and effectiveness
.
JAMA
2018
;
320
:
867
8
.
13.
Agarwala
V
,
Khozin
S
,
Singal
G
,
O'Connell
C
,
Kuk
D
,
Li
G
, et al
Real-world evidence in support of precision medicine: clinico-genomic cancer data as a case study
.
Health Aff
2018
;
37
:
765
72
.
14.
Khozin
S
,
Blumenthal
GM
,
Pazdur
R
.
Real-world data for clinical evidence generation in oncology
.
J Natl Cancer Inst
2017
;
109
:
1
5
.
15.
Korn
EL
,
Liu
PY
,
Lee
SJ
,
Chapman
JA
,
Niedzwiecki
D
,
Suman
VJ
, et al
Meta-analysis of phase II cooperative group trials in metastatic stage IV melanoma to determine progression-free and overall survival benchmarks for future phase II trials
.
J Clin Oncol
2008
;
26
:
527
34
.
16.
Imbens
GW
,
Rubin
DB
.
Causal inference: for statistics, social, and biomedical sciences an introduction
.
New York, NY
:
Cambridge University Press
;
2015
.
17.
Robins
JM
,
Hernán
MA
,
Brumback
B
.
Marginal structural models and causal inference in epidemiology
.
Epidemiology
2000
;
11
:
550
60
.
18.
Chinot
OL
,
Wick
W
,
Mason
W
,
Henriksson
R
,
Saran
F
,
Nishikawa
R
, et al
Bevacizumab plus radiotherapy–temozolomide for newly diagnosed glioblastoma
.
N Engl J Med
2014
;
370
:
709
22
.
19.
Cho
DY
,
Yang
WK
,
Lee
HC
,
Hsu
DM
,
Lin
HL
,
Lin
SZ
, et al
Adjuvant immunotherapy with whole-cell lysate dendritic cells vaccine for glioblastoma multiforme: a phase II clinical trial
.
World Neurosurg
2012
;
77
:
736
44
.
20.
Lee
EQ
,
Kaley
TJ
,
Duda
DG
,
Schiff
D
,
Lassman
AB
,
Wong
ET
, et al
A multicenter, phase II, randomized, noncomparative clinical trial of radiation and temozolomide with or without vandetanib in newly diagnosed glioblastoma patients
.
Clin Cancer Res
2015
;
21
:
3610
8
.
21.
Lai
A
,
Tran
A
,
Nghiemphu
PL
,
Pope
WB
,
Solis
OE
,
Selch
M
, et al
Phase II study of bevacizumab plus temozolomide during and after radiation therapy for patients with newly diagnosed glioblastoma multiforme
.
J Clin Oncol
2011
;
29
:
142
8
.
22.
Stupp
R
,
Mason
WP
,
van den Bent
MJ
,
Weller
M
,
Fisher
B
,
Taphoorn
MJ
, et al
.
N Engl J Med
2005
;
352
:
987
96
.
23.
Thakkar
JP
,
Dolecek
TA
,
Horbinski
C
,
Ostrom
QT
,
Lightner
DD
,
Barnholtz-Sloan
JS
, et al
Epidemiologic and molecular prognostic review of glioblastoma
.
Cancer Epidemiol Biomarkers Prev
2014
;
23
:
1985
96
.
24.
Lamborn
KR
.
Prognostic factors for survival of patients with glioblastoma: recursive partitioning analysis
.
Neuro Oncol
2004
;
6
:
227
35
.
25.
Curran
WJ
,
Scott
CB
,
Horton
J
,
Nelson
JS
,
Weinstein
AS
,
Fischbach
AJ
, et al
Recursive partitioning analysis of prognostic factors in three radiation therapy oncology group malignant glioma trials
.
J Natl Cancer Inst
1993
;
85
:
704
10
.
26.
Franceschi
E
,
Tosoni
A
,
Minichillo
S
,
Depenni
R
,
Paccapelo
A
,
Bartolini
S
, et al
The prognostic roles of gender and O6-Methylguanine-DNA methyltransferase methylation status in glioblastoma patients: the female power
.
World Neurosurg
2018
;
112
:
e342
7
.
27.
Hirano
K
,
Imbens
GW
.
Estimation of causal effects using propensity score weighting: an application to data on right heart catheterization
.
Heal Serv Outcomes Res Methodol
2001
;
2
:
259
78
.
28.
Li
L
,
Greene
T
.
A weighting analogue to pair matching in propensity score analysis
.
Int J Biostat
2013
;
9
:
215
34
.
29.
Li
F
,
Morgan
KL
,
Zaslavsky
AM
.
Balancing covariates via propensity score weighting
.
J Am Stat Assoc
2018
;
113
:
390
400
.
30.
Grayling
MJ
,
Mander
AP.
Do single-arm trials have a role in drug development plans incorporating randomised trials?
Pharm Stat
2016
;
15
:
143
51
.
31.
Grossman
SA
,
Schreck
KC
,
Ballman
K
,
Alexander
B
.
Point/counterpoint: randomized versus single-arm phase II clinical trials for patients with newly diagnosed glioblastoma
.
Neuro Oncol
2017
;
19
:
469
74
.
32.
Pond
GR
,
Abbasi
S
.
Quantitative evaluation of single-arm versus randomized phase II cancer clinical trials
.
Clin Trials
2011
;
8
:
260
9
.
33.
Ratain
MJ
,
Sargent
DJ
.
Optimising the design of phase II oncology trials: the importance of randomisation
.
Eur J Cancer
2009
;
45
:
275
80
.
34.
Rubinstein
L
,
Leblanc
M
,
Smith
MA
.
More randomization in phase II trials: necessary but not sufficient
.
J Natl Cancer Inst
2011
;
103
:
1075
7
.
35.
Sambucini
V
.
Comparison of single-arm vs. randomized phase II clinical trials: a bayesian approach
.
J Biopharm Stat
2015
;
25
:
474
89
.
36.
Sharma
MR
,
WM
,
Ratain
MJ
.
Randomized phase II trials: a long-term investment with promising returns
.
J Natl Cancer Inst
2011
;
103
:
1093
1100
.
37.
Seymour
L
,
Ivy
SP
,
Sargent
D
,
Spriggs
D
,
Baker
L
,
Rubinstein
L
, et al
The design of phase II clinical trials testing cancer therapeutics: consensus recommendations from the clinical trial design task force of the National Cancer Institute investigational drug steering committee
.
Clin Cancer Res
2010
;
16
:
1764
9
.
38.
Hirano
K
,
Imbens
GW
,
Ridder
G
.
Efficient estimation of average treatment effects using the estimated propensity score
.
Econometrica
2003
;
71
:
1161
89
.
39.
A
,
Imbens
GW
.
Large sample properties of matching estimators for average treatment effects
.
Econometrica
2006
;
74
:
235
67
.
40.
Crump
RK
,
Hotz
VJ
,
Imbens
GW
,
Mitnik
OA
.
Dealing with limited overlap in estimation of average treatment effects
.
Biometrika
2009
;
96
:
187
99
.
41.
Vanderbeek
AM
,
Rahman
R
,
Fell
G
,
Ventz
S
,
Chen
T
,
Redd
R
, et al
The clinical trials landscape for glioblastoma: is it adequate to develop new treatments?
Neuro Oncol
2018
;
20
:
1034
43
.
42.
Greenland
S
.
Invited commentary: variable selection versus shrinkage in the control of multiple confounders
.
Am J Epidemiol
2008
;
167
:
523
9
.
43.
Curtis
MD
,
Griffith
SD
,
Tucker
M
,
Taylor
MD
,
Capra
WB
,
Carrigan
G
, et al
Development and validation of a high-quality composite real-world mortality endpoint
.
Health Serv Res
2018
;
53
:
4460
76
.
44.
Bertagnolli
MM
,
Sartor
O
,
Chabner
BA
,
Rothenberg
ML
,
Khozin
S
,
Hugh-Jones
C
, et al
Advantages of a truly open-access data-sharing model
.
N Engl J Med
2017
;
376
:
1178
81
.
45.
Bierer
BE
,
Li
R
,
Barnes
M
,
Sim
I
.
A global, neutral platform for sharing trial data
.
N Engl J Med
2016
;
374
:
2411
3
.
46.
Krumholz
HM
,
Waldstreicher
J
.
The Yale Open Data Access (YODA) Project — a mechanism for data sharing
.
N Engl J Med
2016
;
375
:
403
5
.
47.
Miller
RS
,
Wong
JL
.
Using oncology real-world evidence for quality improvement and discovery: the case for ASCO's CancerLinQ
.
Futur Oncol
2018
;
14
:
5
8
.
48.
Kaizer
AM
,
Koopmeiners
JS
,
Hobbs
BP
.
Bayesian hierarchical modeling based on multisource exchangeability
.
Biostatistics
2018
;
19
:
169
84
.
49.
Hobbs
BP
,
Carlin
BP
,
Mandrekar
SJ
,
Sargent
DJ
.
Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials
.
Biometrics
2011
;
67
:
1047
56
.