Abstract
Purpose: Phase II trials aim to determine whether a cancer treatment is sufficiently promising to justify phase III study. Whether an agent is declared promising in a phase II trial depends on prespecified “null” and “alternative” rates of an outcome of interest such as tumor response. In some cases, the null must be determined with reference to historical data. We sought to determine the proportion of phase II trials that require historical data to establish the null and to determine how these historical estimates were derived.
Experimental Design: We conducted a systematic review of phase II trials published in the Journal of Clinical Oncology or Cancer in the 3 years to June 2005. Data were extracted following a prespecified protocol.
Results: We retrieved 251 papers, of which 117 were found to be ineligible; 70 of 134 included trials (52%) were defined as requiring historical data for design. Nearly half (32, 46%) of these papers did not cite the source of the historical data used, and just 9 (13%) clearly gave a single historical estimate as the rationale for the null. Trials that failed to cite prior data appropriately were significantly more likely to declare an agent to be active (82% versus 33%; P = 0.005). No study incorporated statistical methods to account for either sampling error or possible differences in case mix between the phase II sample and the historical cohort.
Conclusions: Many phase II trials require historical data to determine null response rates. Simple guidelines may improve design and reporting of such trials.
It is widely accepted that a prospective randomized trial is the optimal method for determining the clinical benefit of any cancer treatment. Often described as “phase III,” these trials are costly and typically require large numbers of patients to be followed for many years. As such, the number that can be conducted is limited and is far outweighed by the number of current and potential treatment approaches that could be tested. We have observed that phase III trials often fail to find benefit of a novel treatment, and this led us to explore the process of determining when a treatment approach has shown sufficient promise to justify a definitive evaluation. Because this decision is typically based on the results obtained in the phase II setting, we focused on aspects of the design of these trials and how they ultimately affect the phase II to phase III transition.
In a typical phase II study, a cohort of patients is treated, and the outcomes are related to a prespecified target or “bar.” If the results meet or exceed the target, the treatment is declared worthy of further study; otherwise, further development is stopped. This has been referred to as the “go”/“no go” decision (1). Most often, the outcome specified is a measure of tumor response, e.g., complete or partial response using Response Evaluation Criteria In Solid Tumors, expressed as the proportion of the total number of patients treated (2). “Response” can also be defined in terms of the proportion of patients who have not progressed or who are alive at a predetermined time (e.g., 1 year) after treatment is started. One advantage of this approach is that, as the majority of phase III studies use survival as the primary outcome of interest, it ensures similarity of the phase II and phase III end points (3). Regardless of the end point, the target is determined by specifying a “null” and an “alternative” proportion of responses. For example, in the widely used “Simon optimal” design (4), a null of 5% and an alternative of 25% require a target of four responses in 30 patients (given α = 0.05 and β = 0.1). Many phase II designs are “two stage,” allowing investigators to stop a trial early if it is clear that treatment is not of benefit.
The choice of null and alternative response rates is based on considerations such as the therapies available for the patient population under study and the question being addressed. A lower bar might be chosen when there is no standard therapy, whereas a higher bar is required when there are approaches of known effectiveness. The null and alternative response rates have an important impact on whether an agent is declared worth of further study. An agent associated with, say, a 20% response rate would likely be declared worthy of further study if the null and alternative were set at 5% and 20%; if rates of 10% and 30% were chosen instead, the agent would be declared worthy of further study only about half the time. We are therefore interested in how investigators choose the null for phase II trials.
A key issue is whether historical data are required to determine the null. For most diseases, second and third line cytotoxics have not shown a survival benefit, and response rates are typically very low. In this setting, it is highly unlikely that a tumor will shrink in the absence of treatment, and appropriately, the null is often set at slightly above zero, with 5% or 10% being typical. Such trials need not make reference to historical data. In contrast, when a novel agent is added to an existing standard in the hope of increasing response rates over and above those expected from the standard regimen alone, historical data on the response rates to the standard regimen are required. Similarly, some agents are thought to slow disease progression, rather than lead to rapid tumor regression, necessitating an end point such as progression-free or overall survival at 1 year. A target for 1-year survival clearly needs to be developed by reference historical data on typical survival rates in the absence of the experimental agent.
This study focused on the use of historical data in the design of phase II trials. We reviewed a sample of published phase II trials to estimate the proportion of trials that required historical data, the source of the historical estimates, and whether adjustments were made for sampling error or differences in case mix between the historical and contemporary patient populations.
Materials and Methods
Search strategy. We searched Medline for papers with the term “phase II” in the title or abstract that were published in either the Journal of Clinical Oncology or Cancer between June 1, 2002, and June 1, 2005. These two journals were selected because they publish the largest number of phase II trials in which a statistical design is reported (5). There are no data to suggest that trials published in these two journals are importantly unrepresentative of phase II oncology trials in general.
Eligibility. To be included in the review, the paper had to be an original report of a trial in cancer patients that was either a single-arm study or a multiarm study in which a hypothesis for each arm was tested independently. Using these criteria, a study that accrued patients with two distinct tumor types and defined a specific outcome for each was included, whereas a randomized trial comparing which of two doses of an agent was superior was not. The primary end point had to be a binary measure of an antitumor effect, such as tumor response or 1-year survival. Trials examining only toxicity or quality of life end points were excluded. The final requirement was that both a null and an alternative response rate had to be specified. Trials in which rates were not explicitly defined were included, if rates could be inferred from the stated design by analyzing the number of patients accrued in each stage, and the number of responses required to declare the agent active. Reasons for excluding a trial were defined as not a clinical trial; end point not oncologic; comparative trial; primary end point not binary; either null or alternative not specified. If a trial met more than one exclusion criterion (e.g., a randomized symptom control trial), only the first in the above list was recorded (in this case, end point not oncologic).
Data extraction. For each trial included, we documented the intervention (single agent or combination treatment); the primary end point (tumor response, tumor marker, overall survival, progression free survival, or others); and the null and alternative response rates. We also documented the results as null rejected (agent declared active and worthy of further study); alternative rejected (agent inactive); or unclear. For trials with multiple arms, we classified the result as null rejected if this was true for any arm.
We also documented whether the trial required historical data to determine the target response rate. Because this is somewhat subjective, depending on the patient group under study, the end point of interest, and the alternative treatments available, we used three objective criteria. First, we included trials where the end point was survival, progression, or recurrence rate, on the grounds that expected event rates can only be assessed in the light of prior data. We also included trials where the null or alternative were explicitly justified with reference to historical data. Finally, we included trials where the specified null was a tumor response rate of more than 10%. Our rationale was that a high tumor response rate for a null hypothesis suggests a historical level of activity that the investigators wish to supersede. For example, it would be unusual to declare a treatment inactive where, say, 30% of patients experienced tumor response, if the expected response rate in the absence of the investigational agent was close to zero, or if there were no standard therapies of demonstrated effectiveness.
For trials defined as requiring historical data, we recorded the justification for the null as none, reference to institutional experience; reference to historical experience without citation; or specific citations given. Where a justification was given, we also recorded the number of patients in the historical comparison and whether these were from phase I or II trials, phase III trial, or cohort studies. Statistical adjustments made to address potential differences in patient characteristics of the historical and contemporary groups were also considered, as were adjustments made to address statistical imprecision in the estimate from the historical cohort. Eligible trials with multiple arms typically used identical designs for each arm. However, in a small number of studies, different null or alternative response rates were used. In these cases, one arm was selected at random for inclusion in the study.
Results
We retrieved 251 papers, of which 117 (47%) were found to be ineligible (Table 1). Of the 134 included trials, 64 (48%) studied a single agent and 70 (52%) studied a regimen that included more than one agent. The most common primary end point was tumor response (106 trials; 79%), followed by progression-free survival (11; 8%), tumor marker (10; 7%), and overall survival (6; 4%). One trial used an idiosyncratic definition of therapeutic success that included tumor response. The null and alternative response rates are shown in Table 2. The null was 10% or less in about one-half of trials. In 78 trials (58%), the null was rejected; for 44 (33%) and 12 (9%) of the trials, respectively, the alternative was rejected, or results were unclear.
Total reports evaluated | 251 | |
Excluded from analysis | 117 (46%) | |
Reasons for exclusion | ||
Not a clinical trial | 7 (6%) | |
End point not oncologic | 9 (8%) | |
Comparative trial | 21 (18%) | |
End point not binary | 13 (11%) | |
Either null or alternative unspecified | 67 (57%) | |
Eligible for analysis | 134 (54%) |
Total reports evaluated | 251 | |
Excluded from analysis | 117 (46%) | |
Reasons for exclusion | ||
Not a clinical trial | 7 (6%) | |
End point not oncologic | 9 (8%) | |
Comparative trial | 21 (18%) | |
End point not binary | 13 (11%) | |
Either null or alternative unspecified | 67 (57%) | |
Eligible for analysis | 134 (54%) |
Null response rate (%) . | Number of trials (% of total) . | Alternative response rate* (%) . | Number of trials . |
---|---|---|---|
<5 | 8 (6%) | 10 | 2 |
15 | 5 | ||
20 | 1 | ||
5-10 | 64 (48%) | 15 | 3 |
20 | 32 | ||
25 | 13 | ||
30 | 16 | ||
11-20 | 25 (19%) | 30 | 9 |
35 | 4 | ||
40 | 11 | ||
45 | 1 | ||
21-30 | 15 (11%) | 35 | 1 |
40 | 1 | ||
45 | 3 | ||
50 | 10 | ||
31-40 | 7 (5%) | 50 | 1 |
55 | 1 | ||
60 | 5 | ||
41-50 | 10 (7%) | 70 | 7 |
75 | 2 | ||
80 | 1 | ||
51-60 | 3 (2%) | 80 | 3 |
61-70 | 2 (1%) | 80 | 1 |
85 | 1 | ||
Total | 134 (100%) | 134 |
Null response rate (%) . | Number of trials (% of total) . | Alternative response rate* (%) . | Number of trials . |
---|---|---|---|
<5 | 8 (6%) | 10 | 2 |
15 | 5 | ||
20 | 1 | ||
5-10 | 64 (48%) | 15 | 3 |
20 | 32 | ||
25 | 13 | ||
30 | 16 | ||
11-20 | 25 (19%) | 30 | 9 |
35 | 4 | ||
40 | 11 | ||
45 | 1 | ||
21-30 | 15 (11%) | 35 | 1 |
40 | 1 | ||
45 | 3 | ||
50 | 10 | ||
31-40 | 7 (5%) | 50 | 1 |
55 | 1 | ||
60 | 5 | ||
41-50 | 10 (7%) | 70 | 7 |
75 | 2 | ||
80 | 1 | ||
51-60 | 3 (2%) | 80 | 3 |
61-70 | 2 (1%) | 80 | 1 |
85 | 1 | ||
Total | 134 (100%) | 134 |
Rounded to the nearest 5%.
We defined 70 trials (52%) as requiring historical data for design. Of these, 53 (76%) had a tumor response end point: 51 trials had a null >10%; 2 trials had a null of 10% or less, but explicitly specified that historical data were used. The remaining 17 (24%) trials had a time-to-event end point, such as survival, progression, or recurrence rate. Prior published data were used to establish the null in 34 (49%) of the trial reports, including phase I or phase II trials in 25, phase III in 4, both phase I or II and phase III in 3, whereas 1 referred to a cohort study and 1 to a general review. The total number of patients in the historical studies cited ranged from 40 to 6,026, with a median (interquartile range) of 207 (103, 355). Of note was that in 32 (46%) reports, no historical data were cited, 3 (4%) cited historical experience without a specific citation, and 1 cited “institutional experience” not available for peer review.
For this review, we initially interpreted “citation of prior published data” liberally, including, for example, a brief mention in the discussion. We reanalyzed the data using more stringent criteria. These criteria were developed before any analyses were made, drawing associations between citation of historical data and study outcome. Our first requirement was that there should be a single explicit estimate of a response rate. For example, a report stating that “combining data from the prior phase IIs, the response rate was 23%” was acceptable; whereas a statement that “response rates in prior trials have been in the range 15% to 34%” was not. Moreover, we specified that trial must either give an explicit justification for the null or cite prior studies in the methods section. Using these criteria, only nine trials (13% of the total requiring historical estimates) were deemed to have appropriately cited prior data.
We also conducted an exploratory analysis to investigate the association between study results and how prior data were cited. As shown in Table 3, there was a statistically significant association (P = 0.032 by Fisher's exact test) between result (reject null, reject hypothesis, unclear) and citation type (appropriate, not appropriate, none). Excluding studies with unclear results, trials that cited prior data appropriately were significantly less likely to reject the null hypothesis and declare the approach worthy of further study than trials that either did so inappropriately or that did not cite prior data. This relationship remained after controlling for the δ between null and alternative in a multivariable logistic regression (odds ratio, 0.10; 95% confidence interval, 0.02,0.52; P = 0.006).
Citation of historical data . | Number . | Conclusions . | . | Results . | . | |||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | Unclear . | Clear . | Reject alternative (agent not worthy of further study) . | Reject null (agent worthy of further study) . | |||||
No historical data cited | 32 | 3 | 29 | 6 (21%) | 23 (79%) | |||||
Historical data cited | ||||||||||
Did not meet criteria | 29 | 2 | 27 | 4 (15%) | 23 (85%) | |||||
Met criteria | 9 | 0 | 9 | 6 (67%) | 3 (33%) |
Citation of historical data . | Number . | Conclusions . | . | Results . | . | |||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | Unclear . | Clear . | Reject alternative (agent not worthy of further study) . | Reject null (agent worthy of further study) . | |||||
No historical data cited | 32 | 3 | 29 | 6 (21%) | 23 (79%) | |||||
Historical data cited | ||||||||||
Did not meet criteria | 29 | 2 | 27 | 4 (15%) | 23 (85%) | |||||
Met criteria | 9 | 0 | 9 | 6 (67%) | 3 (33%) |
NOTE: By Fisher's exact test, studies that met the criteria for appropriate citation of prior data were less likely to reject the null [3:9 (33%)] than those cited that did not meet the criteria [23/27 (85%); P = 0.006], or that cited no prior data [23/29 (79%); P = 0.016], or either [46/56 (82%); P = 0.005].
We then repeated the analysis changing the third criterion (null response rate >10%) for classifying a trial as requiring historical data. The results did not change: the proportion of trials giving no rationale for the null were 46%, 43%, and 42% if the criterion for requiring historical data was a null tumor response rate of >10%, >15%, and >20%, respectively.
Noteworthy was that not a single study in our analysis incorporated any statistical method to account for the possibility of sampling error or for differences in case mix between the phase II sample and the historical cohort.
Discussion
The choice of the target response rate is a key aspect of phase II design. Poorly chosen targets reduce the ability of phase IIs to determine which agents or approaches should be considered for testing in definitive phase III trials and which should not be evaluated further. The consequences are that patients may be exposed to treatments that are unlikely to be effective, while jeopardizing the development of therapies that are more likely to be beneficial and to improve standards of care. The cost in resources, both in dollars and in time, is difficult to overestimate.
The phase II studies we evaluated include those that measure both tumor regression and those assessing the proportion of patients who had not progressed or who were alive at a fixed time after treatment. We found that although a high proportion (52%) of phase II trials required historical data to determine the null, few justified the choice of null by clearly explaining the results of prior studies. We were, furthermore, unable to find a single study that incorporated statistical adjustments for either sampling error or case mix. Trials that failed to report a rationale for the historical bar, which the new therapy had to exceed, were much more likely to conclude that the new therapy was “active” and worthy of further study. For this analysis, we considered explicit reference to historical data necessary when the null tumor response rate exceeded 10% or when a time-to-event outcome end point such as survival at 1 year was used. Both of these design characteristics imply some level of activity for the historically treated group.
Although a potential limitation of this analysis was that it was restricted to trials reported in the Journal of Clinical Oncology and Cancer, we have no reason to believe that trials published in other journals would differ in terms of the need for historical data. However, it is possible that by focusing on journals that publish a higher-than-average proportion of phase II trials with statistical designs, our estimates of design shortcomings are conservative. It is also possible that details on the historical data used for study design may have been included in the study protocols, but not the published reports we analyzed. However, reports that omitted details of historical data were more likely to be interpreted as positive (P = 0.005). Moreover, researchers attempting to analyze phase II results should not have to take in faith a critical design decision such as the choice of the null. It may also be the case that an approach approved for one disease is tested in a phase II for a different disease. Physicians considering the clinical use of this approach for the new indication should be able to evaluate key trial characteristics critically.
To our knowledge, this is the first report on the use of historical data in phase II design. There are, however, some estimates based on prior reviews that are comparable to those presented here. For example, in a systematic review of phase II study design (5), just over half of the phase II trials published in Cancer and the Journal of Clinical Oncology reported a study design, and that the null was rejected in 74% of trials, both estimates close to what we report.
With the exception of single-agent trials in patients with untreatable tumors, we believe that the rationale for the null level of response must be made explicit. We have the following recommendations for phase II design and reporting (summarized in Table 4). First, if the null is based on historical data, these should be cited and described in the methods. The description should include the dates when the patients were treated, the type of study (phase II, phase III, cohort study), and details of the therapy. With respect to the dates of accrual, it is well recognized that as therapies are accepted, they are used earlier in the natural history of disease in patients with an inherently better prognosis independent of treatment. A single estimate should be derived from the historical data: specifying only a range should be avoided. For instance, take the case where three prior studies had been reported with sample sizes of 1,000, 100, and 20 and response rates of 33%, 22%, and 15%. This is a total of 355 responses in 1,120 patients (32%). It is preferable to give this single historical response rate of 32% than to say only that “response rates in prior studies varied from 15% to 33%,” on the grounds that the latter offers no guidance as to the appropriate null: investigators tempted to pick the middle of the range would underestimate the true response rate and inflate the risk of a false positive.
Describe historical cohort |
Type of study |
Diagnoses of patients (disease and stage) |
Dates of accrual |
Treatment received |
Number of patients |
Give single estimate, rather than a range, for historical response/survival rate |
Give explicit justification for why null is higher, lower, or equal to the historical estimate |
Consider adjusting null to account for sampling variation |
Consider adjusting phase II results to account for differences in case mix, when possible |
Describe historical cohort |
Type of study |
Diagnoses of patients (disease and stage) |
Dates of accrual |
Treatment received |
Number of patients |
Give single estimate, rather than a range, for historical response/survival rate |
Give explicit justification for why null is higher, lower, or equal to the historical estimate |
Consider adjusting null to account for sampling variation |
Consider adjusting phase II results to account for differences in case mix, when possible |
The relationship between the null and the historical data should be detailed clearly. For example, in the case of a novel chemotherapy agent added to a single-agent cisplatin for non–small cell lung cancer, the historical data should include the response rates to the cisplatin alone. In this case, the null might rationally be set close to or slightly higher than the historical response rate. Alternatively, if the intervention was less toxic or more convenient than the treatment in the historical cohort, it would be reasonable if the null was at or slightly lower than the historical response rate.
An additional consideration is the use of statistical methods to adjust for imprecision in historical estimates. Such imprecision can have an important effect on study design. For instance, if the historical response rate in 50 patients receiving standard therapy is 50%, the 95% confidence interval around this proportion is ∼35% to 65%. Imagine that investigators set the null at 50% for a trial of standard therapy plus novel agent. If the true response rate were in fact close to 65%, there is a high probability that an ineffective novel agent would be deemed worthy of further study. A statistical approach to this problem has been proposed by Fazzari et al., who suggest using the upper bound of a one-sided 75% confidence interval for the historical data as the null response rate (3).
Differences in case mix between the historical cohort and the study sample should also be considered. End points such as tumor response and survival rate are at least partly predictable using variables such as cancer stage, tumor grade, or biomarkers. The conclusions of a phase II may be misleading if the patients accrued differ on important prognostic variables from those in the historical cohort. For example, if patients in phase II had, on average, lower stage disease than those in the historical cohort, the phase II would overestimate the value of the investigational agent. Techniques have been described that use multivariable models to adjust the comparison between phase II and historical data to account for any differences on prognostic variables (6).
Our analysis shows that over half of phase II trials require historical data to determine a null response rate. This proportion is likely to increase as more effective approaches are identified. We make some simple recommendations to improve the design and reporting of such trials. More appropriate use of historical data in phase II design will improve both the sensitivity and specificity of phase II for eventual phase III success, avoiding both unnecessary definitive trials of ineffective agents and early termination of effective drugs for lack of apparent benefit.
Grant support: NIH grant CA103169.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: The authors declare no conflict of interest.
Current address for V. Ballen: Mount Sinai School of Medicine, New York, NY 10029.