Abstract
Advances in anticancer therapies have provided crucial benefits for millions of patients who are living long and fulfilling lives. Although these successes should be celebrated, there is certainly room to continue improving cancer care. Increased long-term survival presents additional challenges for determining whether new therapies further extend patients’ lives through clinical trials, commonly known as the gold standard endpoint of overall survival (OS). As a result, an increasing reliance is observed on earlier efficacy endpoints, which may or may not correlate with OS, to continue the timely pace of translating innovation into novel therapies available for patients. Even when not powered as an efficacy endpoint, OS remains a critical indication of safety for regulatory decisions and is a key aspect of the FDA’s Project Endpoint. Unfortunately, in the pursuit of earlier endpoints, many registrational clinical trials lack adequate planning, collection, and analysis of OS data, which complicates interpretation of a net clinical benefit or harm. This article shares best practices, proposes novel statistical methodologies, and provides detailed recommendations to improve the rigor of using OS data to inform benefit–risk assessments, including incorporating the following in clinical trials intending to demonstrate the safety and effectiveness of cancer therapy: prospective collection of OS data, establishment of fit-for-purpose definitions of OS detriment, and prespecification of analysis plans for using OS data to evaluate for potential harm. These improvements hold promise to help regulators, patients, and providers better understand the benefits and risks of novel therapies.
Introduction
Overall survival (OS) is the gold standard clinical trial endpoint in oncology. Although OS is a readily recognized efficacy endpoint, OS is also an important safety endpoint because it reflects therapy-driven toxicities. However, there are many situations in oncology in which OS may not be feasible as the primary efficacy endpoint. This is especially true for cancers with long natural histories, for many rare diseases, or when multiple approved therapies are available for patients to choose upon disease progression. In these settings, there is increasing reliance on earlier efficacy endpoints, such as progression-free survival (PFS), event-free survival, minimal residual disease negativity, or overall response rate (ORR) and duration of response.
When regulatory applications rely on randomized controlled trials with earlier efficacy endpoints, the FDA has continued to require OS data for assessing safety. Improvements in PFS in the setting of potential detriment in OS have not been considered to be indicative of clinical benefit. Underscoring the importance of OS as a safety metric, several recent examples demonstrated that common early endpoints do not always predict the treatment effect or potential harm (1). Analyses of six randomized trials involving PI3K inhibitors for hematologic malignancies (2), and three trials involving PARP inhibitors for ovarian cancer, demonstrated PFS improvements but were associated with potential OS detriments for patients receiving the drugs of interest (1). Another randomized trial investigating venetoclax plus bortezomib versus bortezomib alone for multiple myeloma found very favorable PFS, ORR, and minimal residual disease negativity rates for the combination therapy (3). Strikingly, the trial found a statistically significant OS detriment that suggested patients receiving the combination therapy were twice as likely to die during the trial compared with patients receiving bortezomib alone. Discordant results such as these can occur for numerous reasons, including significant toxicities within the context of modest efficacy, inadequate dose optimization, trial considerations such as crossover and lack of statistical power, or treatment effect variability between trial subgroups (1). These issues are key challenges being considered through the FDA’s Project Endpoint (4).
When OS is not a primary or key secondary endpoint, it is analyzed descriptively without formal planned statistical calculations to control for Type I errors. Limited OS data availability at the time of primary endpoint assessment further complicates analyses, particularly for diseases with long survival times. Additionally, extended follow-up for OS data collection is often absent following primary endpoint analysis. These issues create challenges for interpreting OS data in regulatory applications.
In July 2023, the FDA collaborated with the American Association for Cancer Research (AACR) and the American Statistical Association (ASA) to conduct a workshop titled “Overall Survival in Oncology Clinical Trials” to discuss emerging challenges with timely assessment of OS for novel therapies (5). Session working groups composed of researchers, statisticians, physicians, patients, regulators, and industry representatives considered issues including trial design, prespecification of OS, post hoc OS analyses, subgroups considerations in design and analysis, and incorporating OS into benefit–risk assessments. Working groups developed recommendations to improve collection and analysis of OS data, which were presented at the public workshop to gain additional input. This article presents consolidated recommendations and considerations discussed by the working groups at this joint FDA–AACR–ASA workshop to address the challenges associated with the development of a comprehensive plan for the evaluation of OS in cancer clinical trials.
Trial Design and Planning Considerations
Prospective planning, randomization, and well-designed control arms are foundational for clinical trials to efficiently generate evidence supporting the efficacy and safety of novel therapies. However, these features are insufficient for robust interpretation of OS. Without comprehensive planning at the trial design stage or adequate follow-up to collect OS data, trials may produce insufficient or unreliable OS results. Trial design considerations discussed by the panel to improve the use of OS data are included in Table 1. Most importantly, the intent is to encourage thoughtful and comprehensive planning during trial design for fit-for-purpose OS data collection and assessment of harm given the disease and clinical context. Even if OS is not a primary or key secondary endpoint, it is essential to inform patient safety.
• Consider OS as a primary efficacy endpoint in the following circumstances: • The disease is acutely life-threatening or has a significant impact on patient survival • Standard-of-care treatments have proven OS benefits • Earlier endpoints are not validated for the disease being studied |
• The panel recommended all trials be designed to collect and assess OS to inform patient safety, even if the trial is not sized to detect a statistically significant difference in OS from an efficacy standpoint |
• A measure of “harm” including OS and other safety endpoints can be prespecified to rule out specific safety concerns when OS is not the primary efficacy endpoint |
• Trials with crossover elements may complicate OS analysis but may be appropriate under certain circumstances |
• Unequal randomization may reduce statistical power of OS analysis, but may be needed when there is substantial prior evidence of a favorable benefit–risk profile or patient recruitment is expected to be challenging with equal randomization |
• Planning for adequate length of follow-up is important to evaluate patient safety and may be informed by disease setting, patient population, expected survival time, degree of uncertainty acceptable to assess trial objectives, feasibility of obtaining timely data, and expected rate of long-term toxicities |
• The panel recommended Data Monitoring Committees have access to OS data for futility and safety analyses and pre-established guidance included in the committee charter |
• Consider OS as a primary efficacy endpoint in the following circumstances: • The disease is acutely life-threatening or has a significant impact on patient survival • Standard-of-care treatments have proven OS benefits • Earlier endpoints are not validated for the disease being studied |
• The panel recommended all trials be designed to collect and assess OS to inform patient safety, even if the trial is not sized to detect a statistically significant difference in OS from an efficacy standpoint |
• A measure of “harm” including OS and other safety endpoints can be prespecified to rule out specific safety concerns when OS is not the primary efficacy endpoint |
• Trials with crossover elements may complicate OS analysis but may be appropriate under certain circumstances |
• Unequal randomization may reduce statistical power of OS analysis, but may be needed when there is substantial prior evidence of a favorable benefit–risk profile or patient recruitment is expected to be challenging with equal randomization |
• Planning for adequate length of follow-up is important to evaluate patient safety and may be informed by disease setting, patient population, expected survival time, degree of uncertainty acceptable to assess trial objectives, feasibility of obtaining timely data, and expected rate of long-term toxicities |
• The panel recommended Data Monitoring Committees have access to OS data for futility and safety analyses and pre-established guidance included in the committee charter |
Careful consideration of the disease context and treatment intent is essential to determining whether OS should be considered as a primary efficacy endpoint for registrational trials. In trials evaluating therapies for acutely life-threatening diseases, OS events accumulate relatively quickly, and thus, efficacy of an investigational therapy may be readily demonstrated. Additionally, when available therapies have demonstrated OS improvements it is often appropriate to test new therapies against the same metric. In these settings, comparisons to standard of care (SoC) or other active comparator control arms, such as “physician’s choice,” may require a statistically significant improvement on OS or a statistically significant demonstration of noninferiority (6).
Powering trials for OS as a primary endpoint may not be practicable in many contexts, such as diseases with long survival times or rare diseases with few patients. In such instances, OS has frequently been included as a secondary endpoint with a prespecified formal efficacy analysis in conjunction with earlier efficacy endpoints. Regardless of prespecification of a formal statistical analysis for efficacy, analysis of OS as a safety endpoint to evaluate for “potential harm” is always conducted to inform benefit–risk assessments.
Additionally, there are now many disease and clinical contexts in which powering OS even as a secondary efficacy endpoint is not practical because of successes in extending patients’ lives and the many treatment options available for patients following progression. In these contexts, planned collection of OS data and prespecified OS analyses designed to evaluate for “potential harm” are currently uncommon but would increase rigor and confidence in trial results. Fit-for-purpose OS data collection and a definition of harm including OS and other safety metrics can be informed by the natural history of the disease, clinical context, SoC, known safety issues, and input from patients. It may not be practical to rule out all probability of potential harm, such as excluding an OS HR of 1, so there may be different thresholds used to inform benefit–risk and decision-making at different timepoints during the trial with planned interim analysis. Additional details and potential statistical methodologies are discussed in the subsequent section.
Certain trial design features such as unequal randomization and crossover should be carefully evaluated based on the clinical context and trial objectives because of the potential impact of these design features on the OS analysis and interpretation. These features may appeal to potential trial participants by increasing their chances of receiving the investigational therapy. This appeal may be driven inappropriately by a perceived lack of clinical equipoise, reflecting the unproven assumption that the investigational therapy is superior to SoC (7). Downplaying established benefits of SoC therapies and discouraging control arm recruitment can potentially delay the widespread availability of novel therapies, increase exposure to unknown harms, and reduce the chances of regulatory approval (8). In particular, crossover can complicate OS interpretation by obfuscating treatment effect attribution in many cases (9, 10). Crossover may be appropriate in certain clinical contexts, for example, when: a superiority analysis demonstrates an OS benefit during the trial (11, 12); the investigational therapy is approved for a later line of therapy (13); drugs similar to the investigational therapy are available off-study; simultaneously developing a drug for multiple lines of therapy; or challenging patient recruitment is expected without crossover. Additionally, reducing barriers to participation (14–16), along with well-designed control arms, can improve patient recruitment without the need for a crossover. Similar considerations apply to unequal randomization. Trials using a 2:1 or greater randomization scheme may unequally expose patients to toxicity, delay trial readout, and increase the required number of patients to adequately assess safety and efficacy (17, 18). However, unequal randomization can be preferable when there is substantial prior experience suggesting a favorable benefit–risk profile, potentially combined with early superiority/futility analysis (8).
Prespecification of detailed plans for OS data collection and follow-up in a study protocol is critical to minimize missing data and obtain reliable OS estimates by avoiding random high or low biases (19). Ideal follow-up durations and frequencies will vary based on the trial objectives regarding OS, the disease and treatment setting, patient population, expected toxicity profile, the presence of nonproportional hazards, expected intercurrent events, and other considerations important to patients. The long-term follow-up part of the trial may not require in-person clinic visits or the same intensity of data collection as required in the active follow-up part of a trial. It is also important to consider how to collect sufficient data regarding: intercurrent events, including subsequent therapy use; reasons for initiating subsequent therapy; time of next therapy; and crossover, as well as planned analysis to handle such intercurrent events using the International Council for Harmonisation (ICH) E9 (R1) estimand framework (20). Prematurely halting OS follow-up may yield unreliable assessments based on a limited number of OS events or provide inaccurate estimates if delayed survival benefits or potential harms are being considered. Conversely, prolonged follow-up may incorporate greater confounding from intercurrent events while delaying drug development. It is ideal to identify a time window for OS analysis when attribution to the investigational therapy is clearest. The intention of these considerations is to increase the completeness and utility of available OS data within common trial designs to support the continued rapid pace of medical advances for patients with cancer.
Prespecified Statistical Analysis Considerations
Prespecification of statistical analysis plans is crucial for achieving reliable interpretations of clinical data. Table 2 includes our condensed considerations to improve the prespecified statistical analysis of OS as a measure to identify safety signals. Importantly, it is critical to carefully consider fit-for-purpose metrics and thresholds to identify OS safety signals. Traditionally, studies often report an OS HR point estimate with a 95% confidence interval (CI), and this metric continues to be robust in many contexts. In other scenarios, such as with nonproportional hazards, it may be more appropriate to consider OS rates at landmark time points, restricted mean survival times, or other statistical analyses of survival curves (21). Prespecified statistical analyses may include formal hypothesis testing of OS if it is a primary or key secondary endpoint, or descriptive evaluations of OS if there is no formal testing or alpha allocation.
• Whether or not OS is formally tested, an analysis plan can prospectively detail the assessment timing and primary methods for analyzing OS |
• Prespecify robust summary measures that can be used to identify OS safety signals, such as the OS HR. Landmark OS rates, ratios, or differences based on restricted mean survival times, or other OS metrics may also be considered |
• Prespecify a threshold or a range of thresholds to indicate potential detriment in the OS summary measure, and evaluate the number of events needed to evaluate for OS detriment based on this threshold for decision-making |
• When choosing such thresholds for OS detriment, consider the following: • Disease setting, length of expected survival, and expected benefit of the control arm • Input from patients and physicians • Expected number of OS events and acceptable level of uncertainty • Expectations for other efficacy and safety endpoints and regulatory pathway • Use of subsequent therapies or crossover • Potential for nonproportional hazards • Length of follow-up • All other available data |
• A prespecified informed estimate of the number of OS events anticipated at each planned analysis timepoint can quantify the projected information available |
• If nonproportional hazards are anticipated, early OS data may not represent the overall OS effect; prespecified extended follow-up, and additional sensitivity analyses may be needed |
• Multiple prespecified sensitivity and supplementary analyses can help interpret results to account for intercurrent events or deviations from assumptions, such as proportional hazards, and are useful to investigate the robustness of results under diverse assumptions |
• Prospectively consider potential intercurrent events and how they will impact OS analysis, be interpreted, and analyzed; the estimand framework can be helpful for this purpose |
• Whether or not OS is formally tested, an analysis plan can prospectively detail the assessment timing and primary methods for analyzing OS |
• Prespecify robust summary measures that can be used to identify OS safety signals, such as the OS HR. Landmark OS rates, ratios, or differences based on restricted mean survival times, or other OS metrics may also be considered |
• Prespecify a threshold or a range of thresholds to indicate potential detriment in the OS summary measure, and evaluate the number of events needed to evaluate for OS detriment based on this threshold for decision-making |
• When choosing such thresholds for OS detriment, consider the following: • Disease setting, length of expected survival, and expected benefit of the control arm • Input from patients and physicians • Expected number of OS events and acceptable level of uncertainty • Expectations for other efficacy and safety endpoints and regulatory pathway • Use of subsequent therapies or crossover • Potential for nonproportional hazards • Length of follow-up • All other available data |
• A prespecified informed estimate of the number of OS events anticipated at each planned analysis timepoint can quantify the projected information available |
• If nonproportional hazards are anticipated, early OS data may not represent the overall OS effect; prespecified extended follow-up, and additional sensitivity analyses may be needed |
• Multiple prespecified sensitivity and supplementary analyses can help interpret results to account for intercurrent events or deviations from assumptions, such as proportional hazards, and are useful to investigate the robustness of results under diverse assumptions |
• Prospectively consider potential intercurrent events and how they will impact OS analysis, be interpreted, and analyzed; the estimand framework can be helpful for this purpose |
When OS is the primary endpoint, sufficient data collection to formally test OS for superiority is necessary. When OS is not the primary efficacy endpoint, trials can be designed to collect sufficient data to evaluate for OS detriment with some level of certainty. There are multiple approaches and considerations to mitigate the potential for substantial OS detriment that would suggest harm and, therefore, permit a conclusion of a net treatment benefit in the context of a marginal potential OS improvement but a clear positive treatment effect on early efficacy endpoints. For example, an OS HR point estimate and its 95% CI can be used to evaluate a specified threshold for OS detriment or a high probability of substantial detriment. A decision framework can be developed to evaluate for a high probability of substantial detriment and identify when additional data may be needed because of residual uncertainty (Fig. 1A). Furthermore, in disease settings with few expected OS events or potential violation of proportional hazards assumption, there are additional considerations for the thresholds specified and desired level of Type I and II errors (Fig. 1B). Identification of multiple thresholds or simulations under different parameter assumptions may help determine criteria for identifying potential harm. Regardless of the specific approach, it is important to justify the clinical relevance and prespecify a definition of harm that is appropriate to evaluate for in a particular study protocol.
If OS has a prespecified hypothesis test, it is critical to prespecify alpha allocation to assure strong Type I error control, as well as informed projections of the available information and anticipated effect size at each planned analysis timepoint. Regardless of Type I error control, prospectively identifying the OS information fraction expected at the time of preplanned analyses would clarify possible degrees of uncertainty for different levels of data maturity, possibly via modeling and simulation. Sensitivity and supplementary analyses can also be prespecified to evaluate the robustness of results and deviations from assumptions, such as testing for proportional hazards (22, 23). These analyses may indicate additional follow-up or other post hoc supplementary analyses are needed.
Crossover and use of subsequent therapies are examples of intercurrent events that complicate the assessment of OS and if planned for at the time of study design can improve statistical rigor (9). However, many methods to address crossover rely on unverifiable assumptions. The estimand framework is a useful way to specify strategies for intercurrent events in an analysis plan. As described in ICH E9(R1), constructing the estimand provides clarity on the clinical question of interest and the strategy to handle potential intercurrent events (20). Under the estimand framework, the intent-to-treat (ITT) principle can be reflected by a treatment policy approach, which assumes that the treatment effect of interest includes intercurrent events. Therefore, data obtained after intercurrent events can be included in the analysis using this approach, assuming that adequate data are collected after intercurrent events.
Post hoc Statistical Analysis Considerations
Scenarios in which OS analyses were not prespecified are considered post hoc. Ideally, OS evaluation is adequately planned to offer sufficient opportunity for patient follow-up and estimate the treatment effect on OS. In some cases, studies initiated many years ago lacking adequate plans for patient follow-up result in sparse long-term OS data. It may be infeasible to collect additional data to reduce uncertainty in such a trial. Post hoc analyses may be warranted because of unanticipated events or results, emerging data in the disease setting, or other scenarios not considered at the planning stage. However, such analyses are inherently less credible than prespecified analyses and potentially have little ability to detect any difference in OS between treatment arms because of being considered “underpowered” or confounded by crossover or other factors. Table 3 includes considerations to maximize the utility of such post hoc analyses.
• Post hoc analyses of biologically plausible subgroups are recommended to evaluate any safety signals, which may suggest that follow-up studies or indication restrictions may be necessary |
• Simulation and supplementary analyses to investigate the robustness of results under diverse assumptions are useful, particularly if OS follow-up or analyses were not adequately planned |
• Complete information for safety is critical; evaluating narratives and toxicity data is important to understand the causality of deaths, such as if deaths are because of progression or toxicity |
• Evaluation of potential harm based on the totality of the safety data is important, and consideration of multiple measures that reflect the full survival distribution is informative |
• Post hoc thresholds for OS detriment can be based on similar considerations as for a prespecified threshold; however, in the post hoc setting, consider the following: • Multiple safety summary measures, including the OS HR • Multiple credible thresholds, or grid of thresholds, to inform decision-making • Projections for additional OS data that could be obtained; even if infeasible to collect, this could help contextualize the amount of information currently available |
• Post hoc analyses of biologically plausible subgroups are recommended to evaluate any safety signals, which may suggest that follow-up studies or indication restrictions may be necessary |
• Simulation and supplementary analyses to investigate the robustness of results under diverse assumptions are useful, particularly if OS follow-up or analyses were not adequately planned |
• Complete information for safety is critical; evaluating narratives and toxicity data is important to understand the causality of deaths, such as if deaths are because of progression or toxicity |
• Evaluation of potential harm based on the totality of the safety data is important, and consideration of multiple measures that reflect the full survival distribution is informative |
• Post hoc thresholds for OS detriment can be based on similar considerations as for a prespecified threshold; however, in the post hoc setting, consider the following: • Multiple safety summary measures, including the OS HR • Multiple credible thresholds, or grid of thresholds, to inform decision-making • Projections for additional OS data that could be obtained; even if infeasible to collect, this could help contextualize the amount of information currently available |
Post hoc analyses are considered descriptive and are complicated by limited OS information and post hoc selection of approaches to handle intercurrent events. Tipping point and sensitivity analyses could help evaluate how the observed results are impacted by various assumptions (22–24), and although such analyses are useful for evaluating the robustness of prespecified analyses, these approaches are more critical in the post hoc setting to evaluate a range of post hoc approaches. Simulations of future data can also evaluate the sensitivity of observed results and contextualize the amount of available OS information, including the context of what has been feasible to collect in similar trials (10, 25, 26). Such simulations may be useful in the post hoc setting or if OS data are limited. However, post hoc evaluation of intercurrent events is likely to be unreliable. It may not be possible to detect an OS detriment with improper study designs, extensive crossover, nonproportional hazards, or other potential confounding factors. Immature OS data also cause high uncertainty in treatment effect estimates.
Some post hoc analyses are important for supplementary analysis of potential safety signals within the available data and identifying potential subgroups for which a restricted indication may be warranted. Obtaining complete safety information, such as evaluating narratives and toxicity data, is important for understanding death causality. Although relevant in prespecified settings, this approach is of particular importance in the post hoc setting or with limited OS data. Additionally, if data are available from other trials, drugs within the same class, real-world data, or other sources, those could be incorporated into post hoc analyses to provide supplementary evidence of a positive treatment effect or reduce concern for potential harm.
Post hoc assessments of a potential OS detriment typically exhibit high uncertainty in the observed summary statistics, thus increasing the importance of investigating the robustness of results under different scenarios. High uncertainty and marginal results may suggest additional follow-up is warranted. Several assumptions are necessary for post hoc assessments regarding the probability of varying levels of harm, and it is unclear which assumptions are most appropriate. Additionally, it is not possible to evaluate the bias when determining thresholds for OS detriment retrospectively under unverifiable assumptions. To assist assessments for the investigation of a potential OS detriment, a range of different post hoc thresholds could be used, considering multiple safety summary measures, projections for future data, and clinical context. It is also useful to assess the impact of multiple strategies to handle intercurrent events if they were not prespecified.
Subgroup Considerations
Precision medicine has led to greater subdivisions of patients in an attempt to match patients with therapies most likely to benefit them. Although precision medicine brings much promise for improving cancer outcomes, it also raises challenges related to smaller sample sizes for analyzing and interpreting findings within subgroups. Table 4 includes considerations discussed during the workshop on the role of subgroups in relation to the primary ITT analyses and the interpretation of OS subgroup results, with or without Type I error control. When evaluating subgroups, considerations are context-dependent, and it is important to prespecify planned analyses and interpretations. Prespecification of subgroups provides opportunities for better data collection and increases the credibility of subgroup findings.
• Dedicated analyses of known subgroups of patients, such as biomarker-positive subgroups or populations vulnerable to adverse events, can be prospectively planned and may require longer follow-up for OS • Depending on the intended use within the overall development program, OS analyses in known subgroups may not need to be alpha-controlled. Emphasis on well-characterized subgroups, such as age, gender, and racial and ethnicity groups, and known prognostics or predictive biomarkers can improve the intrinsic validity of such evaluations • Early interim subgroup analyses can be prespecified in order to mitigate concern for potential harm • If there are no preliminary efficacy data for a particular subgroup, such as biomarker-negative patients, a study could be designed to enroll biomarker-positive patients only |
• Consider the following when designing trials with known subgroups of interest: • Biological or mechanistic rationale for subgroups • Expected treatment effect, or lack of effect, in each subgroup • Expected contribution of the subgroups to the overall treatment effect • Expected prevalence, which ideally reflects the intended patient population, and enrollment in subgroups and their complement • Feasibility of stratification for prognostic or predictive subgroups • Objective of each planned analysis, whether for safety or efficacy • Detailed analysis plan, including statistical methods, measures of uncertainty, and prespecified thresholds for efficacy and/or harm within subgroups • If a formal test is planned: sample size that would be required to detect a clinically meaningful treatment effect in each subgroup |
• Subgroup analyses are always of interest, even in the post hoc setting, to identify any subpopulations that may benefit the most, explore subgroups that seemed to lack benefit, evaluate potential OS detriment, or explore concerns related to external data • Evaluate if there is repeated evidence of OS detriment in the same subgroup from other trials • Subgroup results should be interpreted with caution, especially if OS data are immature or analyses were conducted post hoc |
• Adaptive trial design features can facilitate enrichment of subgroups of interest, but may complicate assessment of the primary endpoint; prespecification of analysis methods and adequate planning are necessary to evaluate results • If using adaptive designs, prospectively evaluate how it will impact the level of evidence from the subgroup of interest, such as biomarker-negative subgroup • Using adaptive design features does not “lower the bar” for establishing efficacy in subgroups • Prespecify early futility analyses to inform continued enrollment of the patients who are outside the subgroup of interest |
• Dedicated analyses of known subgroups of patients, such as biomarker-positive subgroups or populations vulnerable to adverse events, can be prospectively planned and may require longer follow-up for OS • Depending on the intended use within the overall development program, OS analyses in known subgroups may not need to be alpha-controlled. Emphasis on well-characterized subgroups, such as age, gender, and racial and ethnicity groups, and known prognostics or predictive biomarkers can improve the intrinsic validity of such evaluations • Early interim subgroup analyses can be prespecified in order to mitigate concern for potential harm • If there are no preliminary efficacy data for a particular subgroup, such as biomarker-negative patients, a study could be designed to enroll biomarker-positive patients only |
• Consider the following when designing trials with known subgroups of interest: • Biological or mechanistic rationale for subgroups • Expected treatment effect, or lack of effect, in each subgroup • Expected contribution of the subgroups to the overall treatment effect • Expected prevalence, which ideally reflects the intended patient population, and enrollment in subgroups and their complement • Feasibility of stratification for prognostic or predictive subgroups • Objective of each planned analysis, whether for safety or efficacy • Detailed analysis plan, including statistical methods, measures of uncertainty, and prespecified thresholds for efficacy and/or harm within subgroups • If a formal test is planned: sample size that would be required to detect a clinically meaningful treatment effect in each subgroup |
• Subgroup analyses are always of interest, even in the post hoc setting, to identify any subpopulations that may benefit the most, explore subgroups that seemed to lack benefit, evaluate potential OS detriment, or explore concerns related to external data • Evaluate if there is repeated evidence of OS detriment in the same subgroup from other trials • Subgroup results should be interpreted with caution, especially if OS data are immature or analyses were conducted post hoc |
• Adaptive trial design features can facilitate enrichment of subgroups of interest, but may complicate assessment of the primary endpoint; prespecification of analysis methods and adequate planning are necessary to evaluate results • If using adaptive designs, prospectively evaluate how it will impact the level of evidence from the subgroup of interest, such as biomarker-negative subgroup • Using adaptive design features does not “lower the bar” for establishing efficacy in subgroups • Prespecify early futility analyses to inform continued enrollment of the patients who are outside the subgroup of interest |
When designing a trial with known predictive subgroups, it is critical to prospectively identify clinical questions of interest relating to safety and efficacy for OS. If formal hypothesis testing for OS with alpha control is planned for prognostic or predictive subgroups, it is important to consider the feasibility of stratified randomization based on these subgroups when designing the study with enrollment projection and when developing analysis plans. This would depend on the expected prevalence and sample size required to draw conclusions. The importance of enrolling subgroups of sufficient size and conducting alpha-controlled tests may depend on the expected treatment effect or potential for harm in each subgroup, and the expected contribution of each subgroup to the overall treatment effect. It is important for analysis plans to include details regarding measures of uncertainty and decision rules based on prespecified thresholds for evaluation of efficacy and safety results within subgroups.
Subgroup analysis is important to examine the consistency of observed OS benefit or harm, confirm expected benefit or harm, and uncover clinical clues in subsets of patients. However, subgroup analyses should be interpreted with caution and cannot be used to rescue a failed trial. Findings based on retrospectively defined subgroups or post hoc analyses of prespecified subgroups may result from random patterns and are considered hypothesis-generating, not confirmatory evidence of benefit. Findings based on post hoc OS subgroup analyses of a failed trial do not adequately offer substantial evidence of effectiveness. Additionally, subgroup results from immature OS data are likely to be difficult to interpret.
Adaptive designs can assist enrichment of trial populations with patients most likely to receive benefit, for example, by reducing enrollment of patients from a specific subgroup based on a lack of benefit observed in planned interim analysis (27–30). However, it is important to ensure positive OS results for the ITT population that are driven by a subgroup are not inferred as positive results for patients outside that subgroup; small subgroup populations with unclear benefits are unlikely to provide convincing evidence of a treatment effect in this population. Additionally, adaptive designs require extensive preplanning and evaluation of prior data to inform the enrichment strategies.
Interpretation of OS subgroup results will be largely driven by the context of the trial design, disease setting, and scientific rationale. It can be helpful to evaluate if evidence from other trials consistently suggests OS detriment in the same subgroup. For example, there would be safety concerns in which the ITT population analysis is favorable, but data suggest a biologically plausible subgroup is potentially harmed. In particular, data from other trials that corroborate evidence of elevated toxicity may necessitate halting the trial or other interventions that maintain an overall favorable benefit–risk. Additionally, if a trial demonstrates OS benefit is driven by a subgroup, such as biomarker positivity, the evidence may support its use only in the subgroup rather than the enrolled ITT population (31).
Benefit–Risk Assessment Considerations
Endpoints such as ORR and PFS support both accelerated approval and regular approval depending on the disease context, magnitude of effect, and presumed relationship with OS (32). However, even when earlier endpoints are used, OS is still evaluated. As described in the introduction, therapies that demonstrate treatment effects on early endpoints—including in the context of a large magnitude of a treatment effect—would not be considered to have a favorable benefit–risk profile in the context of a substantial OS detriment. Table 5 consolidates the considerations, potential regulatory implications, and best practices for incorporating early or limited OS results into the benefit–risk assessment discussed during the FDA–AACR–ASA workshop.
• When considering AA, contributions to the benefit–risk assessment include the following: • AA is intended for products that treat serious, life-threatening conditions • Disease setting and definition of clinical benefit inform acceptable levels of OS uncertainty • PFS, ORR, and other early endpoints may be relied upon for AA, and consistency of results across relevant endpoints is important • Additional data from other ongoing trials or real-world data can supplement information • Additional follow-up data, including follow-up for OS, from the same trial or a separate confirmatory trial may be needed to verify and further characterize the anticipated clinical benefit or risk |
• The need for additional postmarketing data may be influenced by the following: • Disease and treatment context, such as frontline, relapsed and refractory, curative, maintenance, or palliative settings • Severity of the condition being treated • Uncertainty in OS results or other efficacy and safety results • Feasibility of obtaining additional information in a timely fashion • Availability of alternative treatments • Potential risks and benefits associated with approved therapy |
• Long-term data are important to consider in a benefit–risk assessment, and an adequate plan to collect data posttrial would include side effects, subsequent therapy, and reasons for initiation of posttrial therapy to supplement the available OS information |
• An analysis that adjusts for crossover under varying assumptions about the treatment effect of subsequent therapy may be useful as supplementary information |
• Consider sensitivity analyses for geography as approval considers the generalizability of the data to the U.S. population; underrepresented subpopulations may need more studies |
• Interpretation of post hoc OS analyses will consider the totality of evidence |
• When considering AA, contributions to the benefit–risk assessment include the following: • AA is intended for products that treat serious, life-threatening conditions • Disease setting and definition of clinical benefit inform acceptable levels of OS uncertainty • PFS, ORR, and other early endpoints may be relied upon for AA, and consistency of results across relevant endpoints is important • Additional data from other ongoing trials or real-world data can supplement information • Additional follow-up data, including follow-up for OS, from the same trial or a separate confirmatory trial may be needed to verify and further characterize the anticipated clinical benefit or risk |
• The need for additional postmarketing data may be influenced by the following: • Disease and treatment context, such as frontline, relapsed and refractory, curative, maintenance, or palliative settings • Severity of the condition being treated • Uncertainty in OS results or other efficacy and safety results • Feasibility of obtaining additional information in a timely fashion • Availability of alternative treatments • Potential risks and benefits associated with approved therapy |
• Long-term data are important to consider in a benefit–risk assessment, and an adequate plan to collect data posttrial would include side effects, subsequent therapy, and reasons for initiation of posttrial therapy to supplement the available OS information |
• An analysis that adjusts for crossover under varying assumptions about the treatment effect of subsequent therapy may be useful as supplementary information |
• Consider sensitivity analyses for geography as approval considers the generalizability of the data to the U.S. population; underrepresented subpopulations may need more studies |
• Interpretation of post hoc OS analyses will consider the totality of evidence |
When OS is not a primary endpoint, uncertain OS results and lack of clarity on potential detriment are common. Substantial uncertainty in OS results caused by data immaturity or discordant results requires additional investigation. First, efforts can be made to increase confidence in the observed results, including evaluation of other endpoints, such as duration of response, assessing results from other ongoing or completed trials, or utilizing real-world data. Additionally, collecting more OS data from the same trial, continuing to follow patients for OS, minimizing the amount of missing data, or a confirmatory trial can be helpful to support regular approval. When uncertainty in OS efficacy is not adequately addressed by these approaches, accelerated approval may be appropriate based on the demonstration of favorable treatment effects on PFS, ORR, or other validated earlier endpoints and a high probability of substantial potential harm has been ruled out. In situations where OS data interpretation may be confounded by crossover or subsequent therapies, earlier efficacy and safety endpoints may be helpful in understanding the clinical benefit and risk. The disease setting and what constitutes clinical benefit based on treatment effects on early endpoints play major roles in determining the acceptable amount of OS uncertainty.
Regardless of the approval pathway, additional OS data in the postmarket setting may be needed to further inform benefit–risk considerations. Robust posttreatment data collection allows for further characterization of OS data and the safety and efficacy profile. Many long-term effects of therapy may impact patients’ ability to receive and benefit from subsequent therapies, but collection of this information has historically been limited. Rigorous long-term collection of safety, efficacy, and subsequent therapy data can allow for improved interpretation of existing OS data.
Conclusion
OS is an important endpoint for both safety and efficacy and is a critical component of oncology drug development programs. Improved prospective planning, collection, and assessment of OS data to evaluate for harm would greatly improve the rigor of risk-benefit assessments for novel therapies as larger proportions of patients with cancer are living long and fulfilling lives. There are numerous potential strategies to incorporate prospective OS analyses of potential harm into benefit–risk assessments, several of which were described in the present article. There is a high degree of context dependence on the aforementioned considerations, emphasizing the importance of fit-for-purpose strategies as well as early discussions with multiple stakeholders, including regulators and patients when designing trials intended to demonstrate the safety and efficacy of oncology products. The FDA–AACR–ASA workshop that inspired this manuscript also included more discussion, disease-specific considerations, nuance, and methodological details; recordings are available (5). FDA’s Project Endpoint intends to provide additional opportunities to engage broad stakeholders to advance innovative and novel endpoints to support the timely demonstration of safety and efficacy (4). In developing novel therapies, using a new lens to refocus on the gold standard oncology endpoint of OS as a valuable and critical assessment of safety, whereas strengthening the robustness of early endpoints such as PFS and ORR to demonstrate clinical benefit, will bring much-needed clarity to patients and their providers.
Authors’ Disclosures
R. Lu reports other support from Alumis Inc. outside the submitted work, as well as a patent for EP003580357B1: Algorithms and Methods for Assessing Late Clinical Endpoints in Prostate Cancer issued. G.D. Demetri reports personal fees from AADI Bioscience, Acrivon Therapeutics, Blueprint Medicines, CellCarta, Ikena Oncology, Kojin Therapeutics, Relay Therapeutics, PharmaMar, Tessellate BIO, WCG/Arsenal Capital, Minghui Pharmaceutical, Merck KGaA, and Zai Lab; other support from Bessor Pharmaceuticals, Boundless Bio, Erasca Pharmaceuticals, and Bayer; and personal fees and other support from Caris Life Sciences outside the submitted work. K.T. Flaherty reports personal fees from Khora Therapeutics, Clovis Oncology, Kinnate Biopharma, Scorpion Therapeutics, Strata Oncology, Checkmate Pharmaceuticals, PIC Therapeutics, Apricity, FogPharma, Tvardi, xCures, Monopteros, Vibliome, ALX Oncology, Karkinos, Soley Therapeutics, Alterome, IntrECate, PreDICTA, Genentech, and TransCode; grants from Novartis; and other support from Takeda during the conduct of the study. R.A. Mesa reports personal fees from Incyte, Bristol Myers Squibb, Novartis, CTI, Telios, Geron, MorphoSys, GSK, and Sierra outside the submitted work. M.A. Sekeres reports grants and personal fees from Bristol Myers Squibb and personal fees from Kurome, Schrodinger, and Karyopharm outside the submitted work. S. Snapinn reports personal fees from Allogene, GSK, Oculis, and Renibus outside the submitted work. K.C. Anderson reports personal fees from Pfizer, Janssen, AstraZeneca, and Daewoong and other support from OncoPep, C4 Therapeutics, Dynamic Cell Therapies, NextRNA, Window, and Starton during the conduct of the study. No disclosures were reported by the other authors.
Disclaimer
The Editor-in-Chief of Clinical Cancer Research is an author of this article. In keeping with AACR editorial policy, a senior member of the Clinical Cancer Research editorial team managed the consideration process for this submission and independently rendered the final decision about acceptability.
Acknowledgments
We would like to thank all panelists of the FDA–AACR–ASA workshop on Overall Survival in Oncology Clinical Trials not listed as authors for their input on this manuscript, including Cong Chen, R. Angelo DeClaro, Laura Esserman, Jaleh Fallah, Lola Fashoyin-Aje, Boris Freidlin, Xin Gao, Elizabeth Garrett-Mayer, Marjorie Green, Wenjuan Gu, Roy Herbst, Alexei Ionan, Bindu Kanapuru, Margret Merino, Pallavi Mishra-Kalyani, David Mitchell, Pabak Mukhopadhyay, Grzegorz Nowakowski, Mary Redman, Nicholas Richardson, Qian Shi, Harpreet Singh, Tatiana Prowell, Craig Tendler, Paz Vellanki, Qi Xia, Jianjin Xu, Anas Younes, Godwin Yung, Jian Zhao, and Emmanuel Zuber. Additionally, we thank Christine Lincoln and Jon Retzlaff for their assistance in organizing the workshop, as well as Jizhou Tian for taking notes and monitoring questions from online attendees.