## Abstract

**Purpose:** Despite development of clinical “value frameworks” by national and international groups, there remains no generally accepted method to summarize toxicity in cancer clinical trials. We explored ways to simplify toxicity data of an arm of a cancer clinical trial to a single value, termed a “weighted toxicity score” (WTS).

**Experimental Design:** We compiled 58 randomized clinical trials of FDA-approved kinase-directed inhibitors. We generated 5 models, each of which assigned different weights for each observed grade 1 to 4 toxicities. For each model, we calculated WTS values as different weighted averages of the sum of the toxicities. We correlated each WTS with the dose reduction rate in each trial, using the dose reduction rate as a clinically relevant surrogate of treatment that is too toxic. The WTS method yielding the strongest correlation with frequency of dose reduction was declared the best model.

**Results:** Nineteen of 58 trials were placebo controlled and had complete data. Of the 5 models examined, differences in dose reduction rates correlated best with differences in WTS using a model with a clinician-weighted scale for toxicities (model M5). The WTS difference thus serves as a surrogate for a desired dose reduction rate difference and could be used to adjust dose/schedule as patients are accrued to a clinical trial.

**Conclusions:** The WTS distills a tabular listing of toxicities of a treatment into a single value, and provides a simple method that can be incorporated into value frameworks, or used to guide discussion of the risks and benefits of systemic therapy. *Clin Cancer Res; 24(20); 4968–75. ©2018 AACR*.

*See related commentary by Vaishampayan, p. 4918*

Many techniques use patient-level or trials-level data to calculate the cost versus benefit of systemic therapy for cancer patients. For example, the ASCO value framework uses a semiquantitative evaluation of toxicity. In some models, patient-level data are used to guide dosing of treatment of subsequent patients in clinical trials. We sought a general method to calculate weighted averages for toxicity based on published datasets from cancer clinical trials, reducing tabular toxicity data to a single number. We term this number the weighted toxicity score (WTS), which directly compares toxicity between treatments in a given trial, allowing assessment of cost/benefit ratios in a reproducible way. We show how the WTS can be used as a gauge during a trial to affect dose selection. We also demonstrate how this model can be applied to trials using patient-reported outcomes, such as PRO-CTCAE.

## Introduction

In randomized trials of systemic therapy in oncology, clinicians examine and compare both efficacy and toxicity of treatments. Efficacy is reported using well-established parameters such as progression-free survival (PFS), and comparisons are made using *P* values and 95% confidence intervals, based on the statistical design (1). Tabular data remain the principal method that adverse effects of treatment are reported in clinical trials. However, toxicity reporting is not done consistently from one study to the next. Occasionally, the differences in individual toxicities are reported using *P* values, but in most studies, there is only a tacit comparison between toxicities in treatment arms. Furthermore, there are very few data published on toxicity as a function of time (2).

How are benefits and toxicities compared in cancer clinical trials? The use of patient-reported outcomes constitutes one way of measuring the impact of toxicity on patients (3, 4). For example, the Patient Reported Outcomes-Common Toxicity Criteria for Adverse Events (PRO-CTCAE) was developed to capture patient adverse events directly from the patient without the filter of a clinician (5). The American Society of Clinical Oncology (ASCO), European Society for Medical Oncology (ESMO), and others have generated tools to integrate both toxicity and benefits of chemotherapy to determine the relative value of a specific treatment to a patient, and by extension to society in general (6–9). These frameworks attempt to quantitate the gestalt a clinician develops regarding the benefit and toxicity of different treatments. The ASCO framework uses a semiquantitative method to calculate net toxicity of a therapy; the ESMO framework presently does not assess toxicity. Furthermore, the ASCO and ESMO frameworks examine different aspects of patient benefit (10). Thus, the question of the role of toxicity in comparing the costs and benefits of a given therapy in a cancer clinical trial remains an open issue.

Our initial interest in drug toxicity stems from participation in clinical trials of novel agents in cancer patients. A number of agents that received regulatory approval in cancer are too toxic at the approved doses for the average patient, either in terms of acute toxicity requiring early dose reduction, or chronic toxicity for people remaining on systemic therapy for extended periods of time. In the most extreme situations, over half of patients required dose reductions in phase III trials of anticancer agents, exposing patients to undue toxicity and potentially leading to premature termination of potentially useful therapy, far in excess of what one would have expected from typical phase I trial design, in which the goal of a classic 3+3 design is to see dose-limiting toxicity in no more than 1 in 6 patients (11–14).

We sought to simplify and make more objective the quantitation of toxicity, which might allow a more direct comparison of outcomes to summary toxicity. Our hypothesis was that reducing the tabular data to a single number will facilitate the discussion of the cost-benefit ratio within an individual clinical trial.

We describe herein a weighted toxicity score (WTS), an index summarizing toxicity for a treatment on a cancer clinical trial. The WTS is the sum of the grade 1 to 4 toxicities in a clinical trial weighted by severity for each grade of toxicity. To test the idea of a WTS in real clinical trials, we sought a simple example for its application. We opted to examine placebo-controlled trials of approved kinase inhibitors in cancer as a coherent group of studies in which to develop our models. We show that the WTS calculations yield a scalar that correlates with the dose reduction rate in clinical trials, which we use as a metric to define excessively toxic therapy during the conduct of a randomized clinical trial.

We demonstrate how the WTS allows for simultaneous comparison of toxicity and efficacy in a clinical trial and show that the correlation between differences in WTS and dose reduction rates allows the WTS to be used as a tool to guide dose reductions during a clinical trial. Lastly, we show how the WTS could be used to calculate overall toxicity using existing patient self-assessment guided platforms, such as the PRO-CTCAE.

## Materials and Methods

### Trial selection

We conducted a PubMed search of randomized controlled trials in oncology between 11/2004 and 05/2016, and also used data from the FDA regarding approved therapies to insure completeness of the selection of clinical trials.

For this proof of principle analysis, we sought a simple dataset from which to work. We identified trials that involved oral small molecule kinase inhibitors as consistent as a class of agents that were relatively toxic; Several oral agents have been approved by regulators despite a need for dose reductions in 50% or more of people receiving them. We chose randomized placebo-controlled trials, because we wanted to estimate the degree of toxicity that could be ascribed to drug, not the disease, using either differences or ratios. In other words, we assumed:

Toxicity (drug specific) ≈ toxicity (disease + drug) − toxicity (disease + placebo).

Having a basis for differences in toxicity from placebo-controlled studies, we then could consider comparing studies of different treatments. We identified 58 randomized-controlled trials (RCT) of FDA or approved small molecule oral inhibitors of kinases and other targets, of which 19 involved a drug versus placebo therapy. Many other studies involved comparisons of treatments, which was beyond the scope of our initial analysis. The 19 studies are the sample we used for our initial analysis. The 58 studies are listed in Supplementary Table S1. The reason for not including each of the other studies in the initial analysis is indicated in the table.

### Definition of WTS

We first note that toxicity data tables from cancer clinical trials indicate the worst grade of toxicity for specific patients. For example, patients who have a grade 3 toxicity who later have improvement to grade 0, 1, or 2 are only counted as having grade 3 toxicity.

We define the WTS for an arm of a clinical trial as the sum of the proportion of patients with toxicities (grades 1–4) weighted by a severity index, which increases from grade 1 to grade 4 for a specific toxicity. The idea for the WTS is shown graphically in Fig. 1. In brief, the WTS for a specific treatment is the sum of all individual toxicity scores for that treatment. The score for an *individual* toxicity of a particular treatment = |${W_1}\cdot{\pi _1} + {W_2}\cdot{\pi _2} + {W_3}\cdot{\pi _3} + {W_4}\cdot{\pi _4},$| where *W* is the weight given to the toxicity for that grade toxicity times the proportion of patients π (%) who experience that grade of toxicity (grades 1–4); more generally, for *i* toxicities and *j* toxicity grades, |$WTS = \mathop \sum \limits_{i,\;j}^{} {\rm{W}}ij\cdot{\rm{\pi }}ij\;.$| The weight *W*_{0} given to patients who had no toxicity (i.e., grade 0 toxicity) is 0.

In a simple case, if all weights are 1, then the WTS is the sum of the percentages of all toxicities in the toxicity tables. If in all cases all *W*_{1} and *W*_{2} are 0 and all *W*_{3} and *W*_{4} are 1, then the WTS is the sum of all grade 3 and 4 toxicities from that arm of the trial.

There were special cases for weighting based on available data. In the case that toxicity grades were combined in a clinical trial (e.g., grade 3 and 4 reported as one proportion), we used the mean of the two weighted toxicity values (in this case [*W*_{3} + *W*_{4}]/2, model M1 below). For models M3–M5 below, which used exponential scales, the combined data for G3 and G4 toxicities were taken to be the geometric mean of the adverse event weights, that is, √ (*W*_{3}·*W*_{4}).

For the definitions of toxicity, we used the list of National Cancer Institute Common Terminology Criteria for Adverse Events (NCI-CTCAE) version 4.03, grades 1 to 4 (15). This listing of 790 different types of adverse events is commonly used for grading toxicity in cancer clinical trials. The full Table of toxicity weights is not necessary when all toxicities of a specific grade are given the same weight.

### Bootstrapping to create confidence intervals around WTS scores

Because each study would yield somewhat different toxicity data if conducted again, we ran 500 parametric bootstrap simulations of the toxicity data from the 19 trials to create confidence intervals around the WTS for each model and each study.

Specifically, while the weight *W _{i}, i* = 0–4, given to each grade of toxicity was set by the model (M1–M5) and not varied, the proportion of patients π

_{i}that specific grade of toxicity was modeled. The proportion of patients π with each grade 0–4 of each toxicity were replicated 500 times as π*

*= π*

_{K}*v*being a random number normally distributed with mean 1 and variance 1/√(

_{K}, v_{K}*n*),

*n*being the number of patients in that arm of the trial. The weighting for each model M1–M5 was then used to determine a WTS based on the bootstrapped proportion of patients with each toxicity π*

*. Confidence intervals for the WTS were then produced, using the quantiles methods.*

_{K}### Definition of the different models M1 to M5

To correlate the calculated bootstrapped WTS to a clinically relevant outcome, we used the dose reduction rate as a measure of excessive toxicity, but we did not know *a priori* the scale that would yield a WTS that correlated best with dose reduction rate. Five models (M1–M5) were generated, each using different weights for the severity indices, which could be used in different situations. Table 1 and Supplementary Table S2 list the weighted toxicity matrices for models M1 to M5.

Model . | Weighting . | Grade 1 . | Grade 2 . | Grade 3 . | Grade 4 . |
---|---|---|---|---|---|

M1 | Linear | 1 | 2 | 3 | 4 |

M2 | Binary | 0 | 0 | 1 | 1 |

M3 | Weak exponential | 1 | 2 | 4 | 8 |

M4 | Strong exponential | 1 | 10 | 100 | 1000 |

M5 | Clinician-modified weak exponential^{a} | varies | Varies | varies | varies |

Model . | Weighting . | Grade 1 . | Grade 2 . | Grade 3 . | Grade 4 . |
---|---|---|---|---|---|

M1 | Linear | 1 | 2 | 3 | 4 |

M2 | Binary | 0 | 0 | 1 | 1 |

M3 | Weak exponential | 1 | 2 | 4 | 8 |

M4 | Strong exponential | 1 | 10 | 100 | 1000 |

M5 | Clinician-modified weak exponential^{a} | varies | Varies | varies | varies |

Abbreviations: M1, linear; M2, binary; M3, weak exponential; M4, strong exponential; M5, clinician-modified weak exponential. All but model M5 use a single set of weights for all toxicities observed, whereas M5 uses a model developed by the investigators, modified from the weak exponential model M3 to more heavily weighted adverse events that represent patient symptoms (Supplementary Table S2).

^{a}For model M5, the weighting of each toxicity for each of 790 CTCAE events is specified (Supplementary Table S2).

We only generated models in which the weights for higher grade toxicities were larger than the weights for lower grade toxicities. We also note that clinically grade 4 adverse events for much more severe than grade 1 to 2 events, such that logarithmic scales were not logical to examine. Thus, model 1 (M1) used a linear scale, with 1 point for grade 1 toxicity, 2 points for grade 2, etc. Model 2 (M2, “binary” weighting) was generated to match criteria for dose reduction in clinical trials. In M2, grade 1 to 2 toxicities were given a score of 0, and grades 3 to 4 were given a score of 1, because dose reductions are usually only given for grade 3 toxicity or worse.

The last three scales were created with comparison with outcomes data in mind, giving exponentially higher scores for each toxicity grade. Model 3 (M3) used a weak exponential score, with 1, 2, 4, or 8 points for grade 1 to 4 toxicities, respectively. Model 4 (M4) used a strong exponential score, with 1, 10, 100, or 1,000 assigned to grade 1 to 4 toxicities, respectively. Lastly, model 5 (M5) used a weak exponential for grade 1 to 4 toxicities like model M3, modified by the consensus of two of the authors and one other medical oncologist regarding severity for each of the 790 events in the CTCAE version 4.03. For example, grade 3 neuropathy received a much larger weight than grade 3 neutropenia in the linear model M1, because most symptoms are more clinically meaningful to a patient than an abnormal laboratory test (Supplementary Table S2).

### WTS model testing

Each of 5 WTS models was calculated for each treatment and control arm of the 19 selected trials, using differences of bootstrapped WTS_{expt}–WTS_{placebo}, to remove the effect of the disease from the toxicity calculation. We conducted this work with the understanding that deaths on a clinical trial (grade 5 toxicities) would be evaluated independently from toxicity data. We performed similar analyses using the ratio of WTS_{expt}/WTS_{placebo}, but the statistical characteristics of this metric were inferior to calculating differences in WTS (data not shown).

To identify a toxicity metric that best correlated with a clinically important endpoint, we correlated dose reduction rate, dose interruption rate, and drug discontinuation rate from toxicity between the two arms of the RCT (experimental − control) to the bootstrapped WTS (experimental–control) for each of the M1–M5 models. These comparisons allowed us to determine which difference in WTS best estimated the dose reduction rate that was observed in the randomized clinical trial.

We also correlated the difference in WTS for each model with other data regarding dose and schedule available from the clinical trials, for example, the proportion of patients who had treatment discontinued for toxicity, or the proportion of patients needing dose interruptions. Data were less commonly available for these other metrics, so our primary focus for this analysis was the dose reduction rate as a clinically relevant indicator of drug toxicity.

We used regression modeling to determine the metric with the strongest correlation (Pearson correlation) between each difference or ratio of WTS and the difference in dose reduction rates between experimental and placebo arms of a trial, controlling for the ln(hazard ratio) of each trial in an ANOVA. We also conducted correlations between the WTS for the 19 studies by each model M1–M5, in order to determine the relatedness of the models we had initially generated.

Once the regression models were developed, we used the WTS models to create tables of WTS differences that corresponded to differences in dose reduction rates between clinical trial arms. The WTS difference corresponding to a specific difference in dose reduction rate could then be selected as a target to predict when to reduce doses for patients on a particular clinical trial arm.

R version 3.2.5 (https://www.r-project.org) and R Studio Desktop version 1.0.153 (R Studio) were used for statistical analysis.

## Results

### Clinical trials reporting of toxicity

We identified 19 trials of FDA- or EMA-approved small-molecule oral kinase inhibitors compared with a placebo (Supplementary Table S1) that contained sufficient data to correlate the dose reduction rate to the toxicity score in a fashion that minimized the adverse effect the cancer itself had on the patient. As indicated in Supplementary Table S1, there are wide variations on how toxicity is reported. Not all studies report the dose reduction rate, dose interruptions, or discontinuations of therapy. In addition, there are differences between studies regarding the proportions of patients with a specific reported toxicity.

### The WTS identifies most and least toxic agents compared with control arms

We calculated WTS involving our five weighting models (M1–M5) for each arm of each randomized clinical trial of small molecule oral protein-targeted inhibitors, using a bootstrap method to provide confidence intervals around the toxicity calculations. The results of these calculations are shown in Table 2. The numbers of patients in each arm of each study are available in Supplementary Table S1.

The differences and ratios of WTS comparing the experimental arms to placebo arms of trials are shown in Table 2. In comparison with placebo, lenvatinib was consistently the most toxic agent across the comparisons. Imatinib in the adjuvant setting was the least toxic agent relative to placebo in 3 of the 5 models examined.

### Correlation of dose reduction rates and WTS

To determine if the WTS could be used to address a clinically meaningful endpoint, we correlated the differences and ratios of WTS between experimental and control arms (5 models) with the differences in the dose reduction rates of experimental versus control arms, accepting that data were collected differently in the different studies. Using the bootstrapping method, the variability of the pseudorandomized data is a function of size of the trial (e.g., randomized phase II vs. phase III) and the calculation method (Supplementary Fig. S1).

Using the bootstrapped data from the 19 placebo-controlled randomized trials, we saw the strongest correlation between differences in dose reductions in the experimental versus control arms of the trials using differences in toxicity using a clinician-modified weak exponential model, M5 (Fig. 2; Table 3). The difference in dose reduction was statistically significantly related using Pearson correlation for model M5 (0.83). Spearman rank correlations were also statistically significant (data not shown). All models showed correlation with dose reductions using the Pearson model, indicating the consistency of the WTS technique regardless of the model. The model M3 yielded the next strongest correlation with dose reduction rates (0.88) involving simpler versions of the modeling (M1–M4). Differences in dose interruption rates between experimental arm and placebo, but not difference in dose discontinuation rates, were also strongly correlated with differences in the WTS, again with the M5 model demonstrating the strongest correlation, >0.9. The relatively broad confidence intervals around the correlation coefficients indicate that any of several models will adequately describe summary toxicity data and its relationship to dose reduction rate.

. | M1 . | M2 . | M3 . | M4 . | M5 . | |||||
---|---|---|---|---|---|---|---|---|---|---|

Model . | Pearson correlation coefficient (95% confidence interval) . | P
. | Pearson correlation coefficient (95% confidence interval) . | P
. | Pearson correlation coefficient (95% confidence interval) . | P
. | Pearson correlation coefficient (95% confidence interval) . | P
. | Pearson correlation coefficient (95% confidence interval) . | P
. |

Difference in dose reduction rate^{a} | 0.58 (0.15–0.82) | 0.012 | 0.76 (0.45–0.90) | 0.0003 | 0.77 (0.47–0.91) | 0.0002 | 0.57 (0.13–0.82) | 0.0144 | 0.83 (0.60–0.94) | <0.0001 |

Difference in dose interruption rate^{a} | 0.71 (0.23–0.91) | 0.0095 | 0.85 (0.53–0.96) | 0.0005 | 0.88 (0.62–0.97) | 0.0002 | 0.74 (0.28–0.92) | 0.0063 | 0.91 (0.72–0.98) | <0.0001 |

Difference in dose discontinuation rate^{a} | 0.23 (−0.26–0.63) | 0.36 | 0.27 (−0.23–0.65) | 0.28 | 0.30 (−0.19–0.67) | 0.22 | 0.37 (−0.12–0.71) | 0.13 | 0.28 (−0.21–0.66) | 0.25 |

HR (PFS)^{b} | −0.47 (−0.01 to −0.77) | 0.048 | −0.56 (−0.12 to −0.81) | 0.016 | −0.57 (−0.14 to −0.82) | 0.013 | −0.48 (−0.01 to −0.77) | 0.046 | −0.52 (−0.07 to −0.79) | 0.027 |

. | M1 . | M2 . | M3 . | M4 . | M5 . | |||||
---|---|---|---|---|---|---|---|---|---|---|

Model . | Pearson correlation coefficient (95% confidence interval) . | P
. | Pearson correlation coefficient (95% confidence interval) . | P
. | Pearson correlation coefficient (95% confidence interval) . | P
. | Pearson correlation coefficient (95% confidence interval) . | P
. | Pearson correlation coefficient (95% confidence interval) . | P
. |

Difference in dose reduction rate^{a} | 0.58 (0.15–0.82) | 0.012 | 0.76 (0.45–0.90) | 0.0003 | 0.77 (0.47–0.91) | 0.0002 | 0.57 (0.13–0.82) | 0.0144 | 0.83 (0.60–0.94) | <0.0001 |

Difference in dose interruption rate^{a} | 0.71 (0.23–0.91) | 0.0095 | 0.85 (0.53–0.96) | 0.0005 | 0.88 (0.62–0.97) | 0.0002 | 0.74 (0.28–0.92) | 0.0063 | 0.91 (0.72–0.98) | <0.0001 |

Difference in dose discontinuation rate^{a} | 0.23 (−0.26–0.63) | 0.36 | 0.27 (−0.23–0.65) | 0.28 | 0.30 (−0.19–0.67) | 0.22 | 0.37 (−0.12–0.71) | 0.13 | 0.28 (−0.21–0.66) | 0.25 |

HR (PFS)^{b} | −0.47 (−0.01 to −0.77) | 0.048 | −0.56 (−0.12 to −0.81) | 0.016 | −0.57 (−0.14 to −0.82) | 0.013 | −0.48 (−0.01 to −0.77) | 0.046 | −0.52 (−0.07 to −0.79) | 0.027 |

^{a}(Treatment arm − placebo arm).

^{b}HR(PFS): Correlation of −ln HR(PFS) with WTS (5 models), experimental arm/placebo arm.

In searching for other variables that might be related to dose reduction rate, there was also a statistically significant correlation seen between the ln(HR) for trials and the dose reduction rate (Table 3). By multivariate analysis, both the ln(HR) and change in WTS were statistically significant factors associated with the difference in dose reduction rate (experimental − control) in these clinical trials. These data are consistent with the recognized clinical finding that that both toxicity and response are a function of dose/schedule, at least in some clinical situations.

The linear relationship between difference in dose reduction rates and difference in WTS between experimental and control arms allows calculation of a target WTS difference for a given difference in dose reduction rates (Table 4; Supplementary Fig. S2). The tables allow the investigator to define a clinically important difference in WTS during conduct of a clinical trial, steering dose and schedule for patients. For example, if the difference in dose reduction rate between control and experimental treatment is to be 20%, exceeding a summary toxicity score using model M3 of more than ∼3.2 is an indication for dose reductions for subsequent patients enrolled on phase III clinical trial. By the same token, if the WTS difference were below ∼3.2, consideration could be given to more intensive therapy for subsequent patients. The same tables may be used for phase II trials, with dose reduction rate used instead of difference in dose reduction rate, and WTS substituted for ΔWTS.

. | M3 . | M5 . |
---|---|---|

Δ(DDR)^{a}, %
. | ΔWTS^{b} (95% CI)
. | ΔWTS^{b} (95% CI)
. |

0 | 0.00 (−4.68, 4.68) | 0.00 (−21.37, 21.37) |

10 | 1.61 (−3.08, 6.31) | 8.63 (−12.79, 30.06) |

20 | 3.23 (−1.50, 7.95) | 17.27 (−4.30, 38.84) |

30 | 4.84 (0.06, 9.62) | 25.90 (4.09, 47.72) |

40 | 6.45 (1.60, 11.30) | 34.54 (12.38, 56.70) |

50 | 8.06 (3.12, 13.01) | 43.17 (20.59, 65.76) |

60 | 9.68 (4.62, 14.74) | 51.81 (28.71, 74.91) |

70 | 11.29 (6.10, 16.48) | 60.44 (36.75, 84.14) |

. | M3 . | M5 . |
---|---|---|

Δ(DDR)^{a}, %
. | ΔWTS^{b} (95% CI)
. | ΔWTS^{b} (95% CI)
. |

0 | 0.00 (−4.68, 4.68) | 0.00 (−21.37, 21.37) |

10 | 1.61 (−3.08, 6.31) | 8.63 (−12.79, 30.06) |

20 | 3.23 (−1.50, 7.95) | 17.27 (−4.30, 38.84) |

30 | 4.84 (0.06, 9.62) | 25.90 (4.09, 47.72) |

40 | 6.45 (1.60, 11.30) | 34.54 (12.38, 56.70) |

50 | 8.06 (3.12, 13.01) | 43.17 (20.59, 65.76) |

60 | 9.68 (4.62, 14.74) | 51.81 (28.71, 74.91) |

70 | 11.29 (6.10, 16.48) | 60.44 (36.75, 84.14) |

Abbreviation: CI, confidence interval.

^{a}ΔDDR: Target difference in dose reduction rate (experimental–control).

^{b}ΔWTS: Predicted difference in WTS for models M3 and M5.

### Applying WTS to randomized comparative cancer clinical trials data

With summary toxicity reduced to a scalar, it was possible to calculate the differences in toxicity between treatment arms of trials. Examples from the kidney cancer literature highlight the use of WTS to compare the benefit of each treatment with the toxicity associated with it. In a clinical trial of pazopanib versus sunitinib in advanced or metastatic renal cancer, pazopanib was associated with a favorable change in 11 of 14 health related quality of life measures compared with sunitinib, without significantly affecting upon PFS or overall survival (OS; ref. 16). Median PFS on pazopanib was 8.4 months versus 9.5 months on sunitinib, meeting the prespecified criterion for noninferiority. Using the M5 scale, the modeled WTS for pazopanib was 24 versus 35 for sunitinib. The difference of 35 – 24 = 11 [95% confidence interval (CI) of (9.8, 12.8) from bootstrapped data], was statistically significant (*P* < 0.001), indicating that pazopanib was 32% less toxic than sunitinib with similar efficacy, consistent with the published quality of life data.

As another example, we compared the combination of lenvatinib + everolimus with either agent alone in a three-arm randomized clinical trial. In this study, lenvatinib/everolimus significantly prolonged PFS compared with everolimus alone [median 14.6 months vs. 5.5 months; hazard ratio (HR) 0.40, *P* = 0.0005], but not lenvatinib/everolimus compared with lenvatinib alone (7.4 months; HR 0.66, *P* = 0.12; ref. 17). A total of 71% of patients on the combination required a dose reduction, versus 62% of people on lenvatinib and 26% on placebo. Using the M5 toxicity scale, everolimus yielded a WTS of 24, lenvatinib gave a WTS of 49, and the WTS for the combination was 43, mostly by virtue of less proteinuria seen in the group receiving combination therapy versus lenvatinib alone. Examining lenvatinib versus the lenvatinib–everolimus combination, Given numerically greater toxicity of lenvatinib alone over the combination [49 − 43 = 6, 95% CI (−3, 16) using bootstrapped data, *P* = not significant), the data provided by the WTS model support use of the combination over lenvatinib alone.

### Application of WTS concept to CTCAE-PRO

The WTS concept is not confined to NCI-CTCAE version 4.03-defined trials. For example, in the patient-reported outcomes tool PRO-CTCAE, both severity of the toxicity and chronicity are evaluated. As a result, an element of toxicity “area under the curve” is achieved that is not part of the regular CTCAE. A WTS model for this idea is provided in Supplementary Table S3. The WTS for each adverse event can be defined as the product of three factors: toxicity weight (W), proportion of patients with a specific toxicity, and either functional interference score or intensity score. The sum of these products becomes the WTS for the specific treatment and can be defined for a particular time period. The sum of the WTS for each assessed time period divided by the total time on treatment gives a marker of intensity of adverse events per unit time. Like the CTCAE WTS assessment, all low-grade or high-grade toxicities could be combined and weighted by specific grades, like model M1, or individually, like model M5.

Few trials have been completed with data reported using PRO-CTCAE. One available study is RTOG 1012, which tested placebo versus liquid honey versus honey lozenge in relieving radiation-associated side effects in lung cancer patients. In this study, clinicians substantially underreported patient adverse events compared with patients themselves. G1–2 and G3–4 events from CTCAE and low- and high-grade PRO-CTCAE events were reported (18). Examining the sum of low-grade events and weighting high-grade events as 4 times as much low-grade events (considering intensity, proportions of patients, and functional impairment as separate variables) yields a WTS for control patients of 51 and 54 for the experimental group; no statistically significant difference was observed between experimental and control groups regarding adverse event outcomes in the trial.

## Discussion

Phase I trials of novel systemic therapeutic agents are often designed to achieve a low rate of dose-limiting toxicity (DLT), for example, no more than 1 in 6 in a 3 + 3 design to select the maximum-tolerated dose for phase II and later studies. Because some of the drugs examined in this trial had dose reduction rates over 50%, it is clear that improvements in the incorporation of toxicity in dose selection in cancer clinical trials are necessary. The WTS provides a straightforward method to summarize toxicity in a clinical trial. In our own example, we focused on kinase-directed therapies, and future work will be necessary to expand to more traditional cytotoxic agents or immunotherapeutic agents.

The clinician-modified scale M5 had strongest correlation to dose reduction rates in placebo-controlled clinical trials, with the simpler M3 scale the model with the next strongest correlation to dose reduction rates; the confidence interval on the correlation coefficients indicate that any of a number of linear models may be acceptable for scoring cumulative toxicity burden.

This work is one of many efforts to assess the overall burden of toxicity experienced by patients during a clinical trial. For example, the TAME method captured acute toxicity (T), late adverse events (A), mortality risk (M), and end results (outcomes, E) in radiation oncology trials (19). TAME gives all high-grade toxicity the same weight and does not include lower-grade toxicity in its assessment. Lee and colleagues used patient-level data and a regression method to assign an overall toxicity burden score (TBS) that defines dose-limiting toxicity (TBS ≥1) in a phase I clinical trial, replacing individual high-grade toxicities as a rationale for declaring a dose-limiting toxicity (20). Their work was extended to show that different types of toxicities of the same grade create different toxicity burdens, underscoring the idea that not all adverse events have the same effect on a patient (21). Other methods used ANOVA (22) or Bayesian approaches (23) focusing more on toxicity prediction during a trial to determine which patients will have undue toxicity as the study is conducted.

Though we developed the WTS model to examine trial level data, this model can be used to trigger dose reductions for patients in a clinical trial (Table 4; Supplementary Fig. S2), using WTS as a target or threshold for dose reduction for future patients, much as the TBS of Lee et al. (20) has been designed to do in earlier phase clinical trials. The investigator may choose a WTS target that is clinically relevant based on the parameters for the specific trial.

The updated ASCO value framework also uses fixed scores for specific grades of toxicities and also uses data not generally available in reporting of clinical trials (i.e., scoring of significant toxicity that persists one year after start of therapy). Instead of semiquantitation used for toxicity burden assessment in the ASCO value framework, the quantitative summary toxicity here (WTS) or from Lee et al. (TBS) may be simpler metrics to include in further iterations of the ASCO value framework; such toxicity assessment in the ESMO framework has been deferred for future discussion. The present work differs from prior studies by conducting a comparative analysis of study level data and does not require the examination of individual patient-level data to draw conclusions of the toxicity of one arm of a clinical trial versus another.

Given the striking difference of toxicity reported by patients themselves compared with clinician-scored toxicity (24), and the impact of patient-reported toxicity in improving patient outcomes (3), it is clear that the application of this or other methods for calculation of toxicity burden in clinical trials will need to incorporate patient-reported outcomes data as exemplified by the PRO-CTCAE (25). Applying the WTS model to the PRO-CTCAE (Supplementary Table S3) is one that may merit examination in the future, though it will benefit from patient and further clinician input. The first step in applying this model to patient reported outcome data will be to see what the WTS metric yields as such clinical trials are completed. Similarly, some aspects of this work will need to be repeated once data have been gathered using the newest version of the CTCAE, version 5.0.

We focused this work on small molecule “targeted” agents, given the clinical experience that the doses approved for clinical use of targeted agents are often too high for routine use and must be reduced (26). There are many reasons for the observed toxicities. Oral kinase inhibitors frequently are given on a continuous, rather than an intermittent basis. The use of oral agents implies the variable of absorption kinetics that is not an issue for agents administered intravenously. Most targeted drugs are administered to adults as a flat dose, without regard to body surface area or body mass. Finally, by virtue of clinical trials that led to approval of targeted agents in cancer, doses used in phase III were typically not adjusted from prior clinical trials to account for toxicity seen in phase II, suggesting that inadequate dose finding was conducted in earlier phase trials of such agents.

Limitations of this work include the idea that reported toxicities are the worst ones experienced by patients, and do not reflect toxicity over time, which will often abate with dose reduction or interruptions (2). PRO-CTCAE begins to address this issue in a patient-focused manner. Second, we only examined small molecule oral kinase inhibitors and similar compounds, in which many toxicities are absent in comparison with standard cytotoxic agents, such as myelosuppression. Further work will extend our modeling to other cancer treatments as well as an examination of individual patient-randomized data, to determine the models for toxicity that may be most appropriate for conduct of dose-adapted randomized trials.

In summary, despite potential drawbacks, we believe this work can spur further quantitative examination of toxicity in cancer clinical trials. For example, Supplementary Table S1 supports the idea of structured toxicity data reporting (27). To combine the data using the WTS methodology, the scale to use (M1–M5) may vary depending on the ease of use and clinical needs. The strongest correlations were seen with the relatively complex toxicity-by-toxicity scoring method of model M5, but model M3 provided similar information and is much more easily calculated; the WTS metric allows an investigator to define a clinically relevant difference in overall toxicity burden that triggers dose reductions for patients on cancer clinical trials. In the end, this analysis points out that more sophisticated evaluation of both radiological and patient-reported outcomes can be compared simultaneously with toxicity endpoints in new models to provide an ever-greater degree of personalization of oncological care.

## Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

## Authors' Contributions

**Conception and design:** R.G. Maki

**Development of methodology:** M. Carbini, M. Suárez-Fariñas, R.G. Maki

**Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.):** M. Carbini, R.G. Maki

**Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis):** M. Carbini, M. Suárez-Fariñas, R.G. Maki

**Writing, review, and/or revision of the manuscript:** M. Carbini, R.G. Maki

**Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases):** R.G. Maki

**Study supervision:** R.G. Maki

## Acknowledgments

This work was supported in part by the Quad W Foundation; the idea of this manuscript was derived from discussions at the ECCO-AACR-EORTC-ESMO Workshop on Methods in Clinical Cancer Research. The authors are grateful to the reviewers for substantive changes that improved the manuscript. R.G. Maki receives support from the Sarcoma Alliance for Research through Collaboration and the Fondazione Enrico Pallazzo.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked *advertisement* in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.