Developing and Validating a Multivariable Prognostic-Predictive Classifier for Treatment Escalation of Oropharyngeal Squamous Cell Carcinoma: The PREDICTR-OPC Study

Abstract Purpose: While there are several prognostic classifiers, to date, there are no validated predictive models that inform treatment selection for oropharyngeal squamous cell carcinoma (OPSCC). Our aim was to develop clinical and/or biomarker predictive models for patient outcome and treatment escalation for OPSCC. Experimental Design: We retrospectively collated clinical data and samples from a consecutive cohort of OPSCC cases treated with curative intent at ten secondary care centers in United Kingdom and Poland between 1999 and 2012. We constructed tissue microarrays, which were stained and scored for 10 biomarkers. We then undertook multivariable regression of eight clinical parameters and 10 biomarkers on a development cohort of 600 patients. Models were validated on an independent, retrospectively collected, 385-patient cohort. Results: A total of 985 subjects (median follow-up 5.03 years, range: 4.73–5.21 years) were included. The final biomarker classifier, comprising p16 and survivin immunohistochemistry, high-risk human papillomavirus (HPV) DNA in situ hybridization, and tumor-infiltrating lymphocytes, predicted benefit from combined surgery + adjuvant chemo/radiotherapy over primary chemoradiotherapy in the high-risk group [3-year overall survival (OS) 63.1% vs. 41.1%, respectively, HR = 0.32; 95% confidence interval (CI), 0.16–0.65; P = 0.002], but not in the low-risk group (HR = 0.4; 95% CI, 0.14–1.24; P = 0.114). On further adjustment by propensity scores, the adjusted HR in the high-risk group was 0.34, 95% CI = 0.17–0.67, P = 0.002, and in the low-risk group HR was 0.5, 95% CI = 0.1–2.38, P = 0.384. The concordance index was 0.73. Conclusions: We have developed a prognostic classifier, which also appears to demonstrate moderate predictive ability. External validation in a prospective setting is now underway to confirm this and prepare for clinical adoption.

To undertake external (Grade 3) validation of the biomarker classifier, we used an independent cohort (n=385) of consecutive oropharyngeal cancer patients undergoing curative treatment between 2002 and 2011, collated as part of the HPV UK Prevalence study 1 , from Aintree University Hospital NHS Foundation Trust, Liverpool (n=146), University Hospitals Bristol NHS Foundation Trust (n=66), and University Hospitals Coventry and Warwickshire NHS Trust (n=115).58 samples in the validation cohort were missing data on centre of origin.
The centres were selected to ensure a mix of geographic location, centre size and institutional treatment protocols.
Patients were all treated with curative intent, with either chemoradiotherapy, radiotherapy alone or surgery with or without radiotherapy/chemoradiotherapy.
In accordance with UK National Multidisciplinary Guidelines 2 all patients underwent regular follow-up after treatment and were reviewed every 6-8 weeks in first year, every 8-12 weeks in second year, and every 4-6 monthly thereafter, for a period of at least 3 years or until death.
Baseline characteristics, treatment, and outcome data were collated from patients' medical records by clinicians who were blinded to biomarker analyses.
All staging was performed according to the AJCC/UICC TNM 7 th edition clinical staging manual 3 and converted to the new staging system following the publication of AJCC/UICC TNM 8 th edition 4 .Formalin-fixed paraffin-embedded tumour samples from diagnostic biopsies were obtained for each patient.The study received ethical approval from the National Research Ethics Service Committee West Midlands (10/H1210/9).

Laboratory methods
Haematoxylin and eosin (H&E) stained sections from formalin-fixed paraffin-embedded tissue blocks were reviewed by a pathologist to confirm the diagnosis of squamous cell carcinoma and TMAs were constructed using an automated TMA machine (TMA Grand Master 3DHISTECH, Hungary) according to published guidelines 5 .Up to three 1mm cores were obtained from each of the  S1).H-score of continuous variables (BCL2, COX2, Cyclin D1, EGFR-external, HIF1α, PLK1, Survivin) was scaled to 0-10 by dividing H-score by 30, as previously described 8 , and further transformed to z-scores (mean = 0, standard deviation = 1) to enable the input scores for this study to be comparable to scores arising from different clinical cohorts and distributions.p16 was dichotomised into positive and negative categories according to a previously described, clinically validated cut off ('strong and diffuse nuclear and cytoplasmic staining present in ≥70% of the tumour'; H-score equivalent ≥2 intensity x ≥70% = H score ≥140) 9,10 .

Missing value imputation
We first undertook a complete case analysis, with no imputation of missing data, since the survival analysis demonstrated no statistically significant differences in survival between subjects with complete data sets and those with missing data (Supplementary Table S3, Supplementary Figure S2).We then undertook imputation of missing values for molecular variables in four different ways using predictive mean matching and compared the models proposed by complete data sets with those created using imputation of missing values.Four imputation methods were based on combination of two parameters: 1) pmmtype = {1, 2} and 2) boot.method= {simple, approximate bayesian} using R package Hmisc v4.1-1.For details of parameters, see R function Hmisc::aregImpute.

Outcomes
A Cox proportional hazards model was used for survival analyses.The outcome of the model was interpreted as a hazards ratio, which represents an incremental increase in the hazard (event: death of any causes (OS) or disease specific (DSS)) in the intermediate-and high-risk groups relative to low-risk group.
Positive coefficients in the model indicate association with poor outcome, while negative coefficients indicate association with good outcome.

Model building, predictor handling and risk groups
All models were trained exclusively on a pre-assigned training cohort (60% of data) only.The independent validation cohort (40% of data) was used to predict individual patient risk scores using the models trained on the training cohorts.Hscore scaling and z-transformation was performed on training and validation sets separately.For continuous variables, mean and standard deviation was estimated, and Wilcox rank-sum test was used to assess the difference between training and validation cohorts.For categorical (factor) variables, counts were reported, and Fisher's exact test was used to assess the difference between training and validation cohorts.P-values were adjusted for multiple comparisons using Benjamini-Hochberg procedure thereby controlling for false discovery rate.
For survival modeling, differences between the survival groups were assessed by log-rank test.
Univariable models were created using a Cox proportional hazards model.Model performance was also evaluated.Discrimination tests refer to the model's ability to discriminate between those who have an event and those who do not 11 .This was assessed by the Concordance Index (Harrell's C-Index), which is the probability that for two randomly selected cases the predicted and the actual observed event times have the same order.C-index of 0.5 represents random agreement, and 1.0 perfect agreement between model prediction and reality.
Performance was also evaluated using sensitivity, positive predictive value (PPV, precision) and negative predictive value (NPV).For sensitivity, PPV and NPV; patients with censored survival status below 5 years were removed.From the remaining patients, in order to establish ground truth, patients having an event before 5 years were considered as true high-risk and the ones having no event in

Data on comparison of risk factors between the low-and high-risk groups in the molecular biomarkers only model
T category showed statistically significant heterogeneity (p-adjusted=0.015) between low-and high-risk groups when stratified by surgery (Supplementary Table S7).Gender, N category, smoking status and HPV status did not.To assess the effect of the heterogeneity of T category across cases in risk groups when stratified by surgery a comparison was undertaken of overall survival outcomes across T categories focussing on the high-risk group (Supplementary Figure S4).A significantly higher survival was still seen in the T2 high-risk patients treated with surgery compared to non-surgical treatments (HR=0.37,95% CI=0.14-1, p=0.042).There were trends in survival between the two modalities in the T1 and T3 groups with widely separated curves, but these did not reach statistical significance.
Materials and methods. 1  Supplementary Table S5.
Univariable associations of all markers and clinical covariates with overall survival.S11.

Sensitivity comparison of models using complete cases set versus imputed
Imputation was performed on molecular biomarkers only and not on patient's clinical covariates.
donor blocks and transferred to a recipient block.H&E-stained sections of the TMAs were examined to quality assure tumour sampling prior to downstream testing.Freshly cut 4μm sections were used to perform immunohistochemistry (BCL-2, COX-2, Cyclin D1, EGFR external, HIF-1α, p16, PLK1, Survivin) and high-risk HPV DNA in situ hybridization (HR-HPV ISH).The assays used were either verified diagnostic tests performed in a pathology laboratory with Clinical Pathology Accreditation (CPA (UK) Ltd) or 'research use only' reagents that were optimized for the study (Supplementary Table

5
years were treated as true low-risk.Calibration tests reflect 'agreement between outcome predictions from the model and the observed outcomes' 11 .Calibration plots were created on training set (resampling: 100 times) depicting estimation of bias-corrected predicted survival probability versus actual (OS or DSS) survival probability values at 5year.For the validation set calibration, predicted survival probabilities (using Training set fit) were plotted against actual (OS or DSS) survival probabilities at 5-year.Hazard regression (hare) from R package polspline (v1.1.12)was used for the estimation of survival probabilities.Calibration analyses were performed using R package rms (v4.5-0).
) shows proportion of missing values across various molecular and clinical covariates.Heatmap (right) shows combinatorial prevalence of missing values.Red indicates presence and grey indicates absence.(A, B) Calibration plots of the training and validation cohorts for the biomarker only and the composite classifiers.(C-F) Comparison of overall survival of surgery versus no surgery groups in high-risk groups across T categories in the validation cohort of the molecular biomarkers only model.

Table S3 .
Overall survival difference between missing data group and complete data group by biomarker, suggesting no survival differences between the samples with complete molecular data versus the samples with missing data.n

Table S6 .
Multivariable prognostic model of overall survival based on molecular biomarkers using the training cohort (n = 266, number of events = 69).

Table S9 .
Multivariable prognostic model of overall survival based on clinical cofactors only, with TNM8, using the training cohort (n= 266, number of events= 69).