Several studies reported gene signatures that were predictive of response to chemotherapy in internal cross validation and some also showed promising predictive values in small independent validation cohorts. Therefore, it has become increasingly common to employ gene expression profiling as a predictive marker discovery tool in Phase II clinical trials. The rationale is that a semi-quantitative and unbiased look at thousands of transcripts in cancers will reveal strong associations between the expression of some genes and response to therapy. However, there are several caveats why this supervised approach to predictor discovery may not yield reliable predictors under many circumstances. The multiple comparison problem inherent to microarray analysis is well known. It stems from the large number of variables (genes) that are compared between two usually small data sets. This leads to a large number of very small p-values, many of which are due to chance. Another confounder relates to coordinated expression of thousands of genes. Individual transcripts do not represent independent variables but their expression is highly correlated with one another. Clinically important phenotypic characteristics of cancer are often associated with, and likely caused by, this coordinated expression of thousands of genes. For example, estrogen receptor (ER)-positive breast cancers differ from ER-negative cancers in the expression of several thousands of genes. The gene expression pattern of high-grade tumors is also very different from low-grade breast cancers. These structured, large-scale gene expression patterns that are associated with clinical phenotypic features can have profound influence on the predicative marker discovery process.

For example, ER-negative, high-grade breast cancers are more sensitive to many different types of chemotherapies compared to ER-positive and low grade tumors. A simple comparison of transcriptional profiles of breast cancers that respond to chemotherapy with those that did not can reveal many differentially expressed genes. However, most of these genes will reflect the gene expression differences that underlie the phenotypic differences between the 2 response groups. The resulting pharmacogenomic response predictor may represent a predictor of phenotype (i.e. high grade, ER-negative cancers). To what extent these signatures include genes that are predictive to a particular drug as opposed to dominated by phenotype-associated genes predictive of general chemotherapy sensitivity, remains unknown. The probability that a simple supervised pharmacogenomic discovery approach can lead to regimen-specific predictors depends on (i) to what extent the response groups are balanced for strong phenotypic markers (or to what extent the analysis can adjust and account for them), (ii) and on the extent of molecular differences that determine drug-specific response. If drug sensitivity is influenced by the 2-3 fold higher or lower expression of a few dozen genes, such a gene signature may not be readily discovered through supervised pharmacogenomic analysis of data from a typical Phase II trial including 30-80 patients. These modest gene expression differences between responders and non-responders will be masked by the large-scale differences due to any phenotypic imbalance between the groups. The technical noise of microarray experiments can also obscure small-scale gene expression differences. For example, when 24,000 measurements are performed (e.g. Affymetrix U133A gene chip) and the overall concordance is 97.98% in a technical replicate; 1.31% of all measurements can have > 2-fold variation. This means that, 314 genes can be > 2-fold decreased or increased from one experiment to another due to technical noise alone.

To illustrate these points we performed simulation experiments and examined if supervised analysis of gene expression data from Phase II studies could have identified HER-2 over-expression as predictor of response to trastuzumab therapy. Over-expression of HER-2 protein (and mRNA) due to gene amplification is the only currently known predictor of response to trastuzumab therapy in breast cancer. Approximately 30% of all breast cancers have HER-2 over-expression and about 25-30% of HER-2 overexpressing cases respond to trastuzumab therapy. Real gene expression data from 132 newly diagnosed breast cancers were used to simulate 50,000 single agent Phase II trastuzumab studies. True HER-2 gene amplification was assessed by FISH for each case. Only 3.67% of the simulated studies yielded HER-2 mRNA as the top predictor, >96% of the individual “studies” picked a different gene as the most predictive of trastuzumab response. HER-2 was included in the top 10-gene list only 9.73% of the time. However, when HER-2 was a priori defined as a potential predictor, 99.6% of the simulated studies confirmed statistically significant over-expression among responders. This observation has lead us to believe that candidate predictive marker testing may be more efficient than de novo predictor discovery in conjunction with Phase II clinical trials. We suggest a tandem, two-step Phase II trial design to rapidly evaluate a priori defined response markers in the context of a prospective clinical trial.

Tandem, two-step Phase II clinical trial design for predictive marker evaluation.

Enough is known about the mechanism of action of most drugs that one could rationally propose at least one or more response predictors. These could include expression level of a single gene, complex gene signatures or any other molecular measurement. The candidate predictor must be fully defined including cut-off values for positivity and negativity before testing in a clinical trial. Conceptually, testing a rationally designed response predictor in a prospective clinical trial is no different from testing a candidate drug in a therapeutic study. The two-stage Phase II trial design has been used for several decades to identify drugs with promising clinical activity. The goal of these studies is to determine whether a drug has enough clinical activity to warrant larger scale evaluation. During the first stage of a classical optimal two-stage Phase II study, “n1” number of patients are entered intro the trial and if fewer than “r1” number of responses are observed the accrual terminates for lack of activity. Otherwise, accrual continues to a total of “n” evaluable patients. At the end of this second stage, the drug is recommended for further evaluation if the final response rate is > “r”.

Similar Phase II trial design can be applied to prospectively evaluate putative response markers in conjunction with new drugs. Assume that a drug has completed Phase I evaluation and a dose was selected for Phase II testing, and also at least one (but preferably more) putative predictive marker(s) is/are available but the response rate in unselected patients is currently unknown. A tandem two-stage Phase II clinical trial design could be applied to test the drug and the predictor(s) simultaneously. The goal of the study is to determine if the drug is likely to have a certain level of activity in unselected patients, and if it is below the level of interest, can a particular patient selection method enrich the responding population to meet the targeted level of activity in the molecularly selected group. The study starts out as a 2-stage Phase II trial for unselected patients with an early stopping rule for futility. If sufficient numbers of responses are observed during the first stage, the study proceeds to the second stage to establish the benefit rate more precisely in unselected patients. However, if insufficient numbers of events are seen during the initial stage, instead of stopping accrual for futility, the trial remains open for response marker-positive patients only and a second optimal 2-stage trial commences. This second stage is introduced because it is very unlikely that the small group of patients who participated in the first phase (typically n<20) included sufficient number of marker-positive cases to draw conclusion about the activity of the drug in this molecularly defined subset. If insufficient numbers of events occur after accruing “n” number of marker-positive cases during the second step of the study, the trial is discontinued following the early sopping rules and the marker is rejected. Otherwise, the study proceeds to complete accrual of additional marker-positive patients in order estimate the benefit rate more precisely. Sample size calculations for the tandem 2-step design follow the same rules as for a classical 2-stage or Bayesian Phase II design.

If there are multiple, non-overlapping response markers for the same drug, all of these could be tested in a parallel multi-arm study. Accrual to each marker arm could occur simultaneously but results analyzed separately for each arm. If the predictors capture overlapping patient populations more complex adoptive randomization designs could be applied to preferentially randomize patients into better performing marker arms. It is expected that only a small fraction of the total patient population will be marker-positive because the sensitive population is small and the marker is assumed to be reasonably sensitive and specific. Therefore, this design implies that a large number of patients will be screened during the second step of the study to find marker-positive individuals who are eligible for therapy. The exact number needed to screen will depend on marker prevalence. To maximize eligibility for treatment among the screened patients, multiple different predictors can be tested simultaneously on each case. Patients who are positive for any given marker will receive treatment but each marker arm is analyzed separately in a parallel multi-arm design.

98th AACR Annual Meeting-- Apr 14-18, 2007; Los Angeles, CA