## Abstract

Independent validation of risk prediction models in prospective cohorts is required for risk-stratified cancer prevention. Such studies often have a two-phase design, where information on expensive biomarkers are ascertained in a nested substudy of the original cohort.

We propose a simple approach for evaluating model discrimination that accounts for incomplete follow-up and gains efficiency by using data from all individuals in the cohort irrespective of whether they were sampled in the substudy. For evaluating the AUC, we estimated probabilities of risk-scores for cases being larger than those in controls conditional on partial risk-scores, computed using partial covariate information. The proposed method was compared with an inverse probability weighted (IPW) approach that used information only from the subjects in the substudy. We evaluated age-stratified AUC of a model including questionnaire-based risk factors and inflammation biomarkers to predict 10-year risk of lung cancer using data from the Prostate, Lung, Colorectal, and Ovarian Cancer (1993–2009) trial (30,297 ever-smokers, 1,253 patients with lung cancer).

For estimating age-stratified AUC of the combined lung cancer risk model, the proposed method was 3.8 to 5.3 times more efficient compared with the IPW approach across the different age groups. Extensive simulation studies also demonstrated substantial efficiency gain compared with the IPW approach.

Incorporating information from all individuals in a two-phase cohort study can substantially improve precision of discrimination measures of lung cancer risk models.

Novel, simple, and practically useful methods are proposed for evaluating risk models, a critical step toward risk-stratified cancer prevention.

## Introduction

Risk-stratified cancer prevention requires evaluation of risk prediction models in independent prospective studies that do not contribute to model development. Modern cohort studies often collect data on risk factors using a complex two-phase design. Although all subjects contribute information on some questionnaire-based risk factors, information on expensive biomarkers are collected only on a nested substudy of the original cohort, typically stratified with respect to case/control status and certain other covariates observed for all subjects. The methods for the analysis of data from such two-phase studies for fitting various regression models (e.g., logistic regression or Cox model) have been well studied (1–9). There has been, however, limited investigation on efficient utilization of data from two-phase cohort studies for model validation. We focus on evaluating discriminatory ability of the commonly used models for cancer risk prediction that quantify risks of individuals through an underlying risk-score or “linear predictor,” that is, sum of log-relative risks multiplied by the risk factors.

The AUC for evaluating model discrimination can be defined as the probability that the risk-score for a randomly selected subject with the disease is higher than that for a randomly selected subject without the disease. In two-phase studies, one can estimate AUC based only on the second-phase subjects with complete risk factor information using the inverse probability weighted (IPW) approach that accounts for the bias due to complex nonrandom sampling (10–17). Alternative imputation-based approaches could be more efficient (18, 19), but they typically require parametric assumptions for imputation distribution. This is less desirable in model validation studies which are supposed to be empirical in nature.

We propose an alternative method that utilizes information from both the phases of two-phase studies to precisely estimate AUC for evaluating a model predicting risk over a prespecified time interval of τ-years accounting for incomplete follow-up. The subjects not included in the second phase contribute information through a scalar partial risk-score, derived from the observable risk factors. Using this method, we evaluate the AUC of a model incorporating information from questionnaire-based risk factors and four inflammation biomarkers [C-reactive protein (CRP), serum amyloid A (SAA), soluble tumor necrosis factor receptor 2 (sTNFRII), and monokine induced by gamma interferon (CXCL9/MIG)] to predict 10-year risk of lung cancer. We use data from a two-phase cohort study (30,297 ever-smokers, 1,253 patients with lung cancer) from the Prostate, Lung, Colorectal, and Ovarian Cancer (PLCO: 1993–2009) trial.

## Materials and Methods

### Lung cancer data

We evaluated the discrimination of a lung cancer risk prediction model that has potential clinical application for identifying subjects for CT-screening. The CT scan procedure uses low-dose radiation from X-ray machines to scan the body in a helical path and produce detailed images of regions inside the body. It has been demonstrated as an effective method for reducing lung cancer mortality compared with chest radiography (20, 21). The US Preventive Services Task Force recommended annual CT-screening for lung cancer in certain risk factor–based subgroups of individuals (22–24). However, screening could be more beneficial if the high-risk subjects can be identified on the basis of individual risk predictions (21, 25–27), using comprehensive models with the major risk factors.

The original model included some questionnaire-based risk factors: age, gender, race, education, BMI, smoking pack-year categories, number of years since stopped smoking cigarettes, number of years smoked, indicator of >1 pack/day, presence/absence of emphysema, and lung cancer family history (28). It was developed using data from the nonscreening arm of the PLCO study, which included approximately 39,000 ever smokers aged 55 to 74 years with questionnaire-based risk factor information (Fig. 1).

We wanted to evaluate the potential added value of four inflammation biomarkers (CRP, SAA, sTNFRII, and CXCL9/MIG), using data collected in two nested case–control samples within the screening arm of the PLCO study: the discovery sample (960 ever-smokers ages 55–74 years, 500 cases and 460 controls) and the replication sample (929 ever-smokers ages 55–74 years, 468 cases and 461 controls); information on sample selection is available in earlier papers (29, 30). CRP has been shown to be consistently associated with lung cancer (31–33), while the other three were identified as most promising among a larger set of biomarkers investigated in those studies. Relative risk estimates of these biomarkers were obtained, after multivariate adjustment for other questionnaire-based factors, using the discovery sample (Supplementary Table S6). To compute the partial risk score, we used published relative risk estimates (Supplementary Table S6) of questionnaire-based risk factors based on the nonscreening arm participants of the PLCO study (28). These estimates were not adjusted for inflammation biomarkers as the biomarker information was not available for those participants. On the basis of our investigations using discovery sample that had both questionnaire-based risk factors and inflammation biomarkers, the relative risks for the questionnaire-based risk factors changed minimally after adjusting for these biomarkers. We decided to use the published estimates to ensure that the questionnaire-based risk factor information from the screening arm participants were not used for estimating relative risks for those factors.

The current analysis independently evaluates discriminatory accuracy of a model for *τ* = 10-year risk of lung cancer based on the questionnaire-based risk factors and four inflammation biomarkers within categories based on age at enrollment (<59, 60–64, 65–69, ≥70 years) using the screening arm of the PLCO study and the replication case–control study and accounting for incomplete follow-up. Because neither the questionnaire-based risk factors from the screening arm nor the biomarker information from the replication case–control study was utilized for model building, the resulting two-phase study can be assumed to be independent of model development (Fig. 1).

### Partial risk-score approach

Suppose a study includes |${N_{1\ }}$|subjects who developed the disease within a prespecified time interval (i.e., cases) and |${N_0}$| subjects who did not (i.e., controls) with |$N\ = \ {N_1} \+\ {N_0}$|. Let |${{\bf X}}\ $|denote the design vector associated with the risk factors included in a risk prediction model for a binary disease outcome (e.g., lung cancer) and |${{\bf \beta }}$| denote the corresponding vector of log-relative-risk parameters and the “risk-score,” |$S\ = \ {{\bf X\beta }}$| quantifies the association of risk factors with the disease risk. Here, |${{\bf X}}\ $|can include individual risk factors and possibly their interactions as specified by the model. Both the model and the log-relative-risk parameters |${{\bf \beta }}$| are prespecified on the basis of analysis of prior studies and the current study aims to assess discriminatory accuracy of the given model. Some expensive biomarkers may be measured only on a subsample, selected on the basis of case–control status and certain other covariates. The risk factors observed on all the subjects in the cohort are called phase I risk factors and those observed only on the subsample are called phase II risk factors. Accordingly, we can partition |${{\bf X}}$| as |${{\bf X}}\ = {\bm{\ }}( {{{\bf Z}},{{\bf W}}} )$| and |${{\bf \beta }}$| as |${{\bf \beta }}\ = \ ( {{{{\bf \beta }}_{{\bf Z}}},{{{\bf \beta }}_{{\bf W}}}} )$|, where |${{\bf Z}}$| denotes the subvector of |${{\bf X}}$| observed on all subjects in the study, |${{\bf W}}$| denotes the subvector of |${{\bf X}}$| observed only on the subjects in the subsample, and |${{{\bf \beta }}_{{\bf Z}}}$| and |${{{\bf \beta }}_{{\bf W}}}$| are the log-relative-risks associated with |${{\bf Z}}$| and |${{\bf W}}$|, respectively. If the model includes interaction terms among phase I and phase II risk factors, then |${{\bf Z}}$|, by definition, will not include elements for corresponding interaction terms because they are not “observable” at phase I.

The AUC is defined as |${\delta _0}\ = \ {\rm{Pr}}( {{S_1} \ge {S_0}} )$|, where |${S_1}$| and |${S_0}$| are the full risk-scores for a random case and a random control, respectively. Using the observed and missing design vector profiles, we decompose the full risk-scores into a partial risk-score and a missing risk-score: |${S_d}\ = \ {S_{d,{\rm{obs}}}} + \ {S_{d,{\rm{mis}}}}$|, for |$d\ = \ 0,1$|. For subjects included in the second phase, the full risk-scores are observed, but for subjects not included, the partial risk-scores are observed. Let |${R_1}( {{R_0}} )$| be the indicator of inclusion of a case(control) in the second-phase sample. The selection probabilities for second-phase subjects, denoted as |${\pi _1}( {{{{\bf Z}}_{\bf 1}}} )\ = \ \Pr {\rm{(}}{R_1}\ = \ 1\ {\rm{|}}\ {{{\bf Z}}_{\bf 1}})$| for cases and |${\pi _0}( {{{{\bf Z}}_{\bf 0}}} )\ = \ \Pr {\rm{(}}{R_0}\ = \ 1\ {\rm{|}}\ {{{\bf Z}}_{\bf 0}})$| for controls, are assumed known and may depend on some selection variables in the observed design vector profile.

In two-phase studies, the IPW estimator of AUC (15) can be obtained by the following formula:

|${\hat{\delta }_{{\rm{IPW}}}} = \bigg( {\mathop \sum \limits_{i = 1}^{{N_1}}\; \mathop \sum \limits_{j = 1}^{{N_0}} {\frac{{{R_{1i}}}}{{{\pi _1}( {{{{\bf Z}}_{{{\bf 1 i}}}}} )}}}{\frac{{{R_{0j}}}}{{{\pi _0}( {{{{\bf Z}}_{{{\bf 0j}}}}} )}}}I( {{S_1} \ge {S_0}} )} \bigg) \bigg / \bigg. \bigg( {\mathop \sum \limits_{i = 1}^{{N_1}}\; \mathop \sum \limits_{j = 1}^{{N_0}} {\frac{{{R_{1i}}}}{{{\pi _1}( {{{{\bf Z}}_{{{\bf1 i}}}}} )}}}{\frac{{{R_{0j}}}}{{{\pi _0}( {{{{\bf Z}}_{{{\bf 0j}}}}} )}}} \bigg)$|

The IPW approach may be inefficient as it discards the partial risk factor information from the subjects not included in the second phase subsample. To improve efficiency, we propose solving an alternative estimating equation: |$\mathop \sum \limits_{i\ = \ 1}^{{N_1}} \;\mathop \sum \limits_{j\ = \ 1}^{{N_0}} ({\delta _{ij}} - \ {\delta _0}) = 0$|, where

The above decomposition of |${\delta _{ij}}$| into four components correspond to the inclusion status of the different case–control pairs to the phase II subsample. Each term of Equation 1 includes a conditional probability that the risk-score for a case is greater than that of the control, given the partial risk-scores for the case–control pairs. We consider fine categories (e.g., deciles) of the partial risk-score and estimate these conditional probabilities empirically based on the second phase subjects using sampling weights (details in supplement). The proposed two-phase estimator is given by:

where |${\hat{\delta }_{ij}}$| is obtained by plugging in the estimators of conditional probabilities in equation 1. Two assumptions ensure consistency of the proposed estimator: (i) the selection of subjects to the second phase is conditionally independent of the phase II risk factors given the phase I risk factors, and (ii) the disease status and the phase II risk factors are conditionally independent of any sampling selection variables given the partial risk-score. We have derived an influence-function based variance formula of the proposed estimator (details in supplement). The R code implementing these estimators with examples and illustrations on using the code is available in GitHub (https://github.com/parichoy/two_phase_AUC).

### Handling loss to follow-up

In cohort studies, incomplete follow-up can cause misclassification of binary disease status. Here we propose a modification of the method to evaluate the AUC of a model predicting risk over a fixed time period, τ-years, accounting for incomplete follow-up. The subjects who developed disease within τ-years were considered as cases and those who did not were designated as controls. Incomplete follow-up is accounted for by computing the adjusted two-phase and IPW estimators, denoted by |${\hat{\delta }_{{\rm{TPS,adj}}}}$| and |${\hat{\delta }_{{\rm{IPW,adj}}}}$|, based on all the cases and the subset of controls with at least τ-years of follow-up and adjusted for this selection using inverse of the probability that a subject is followed up for at least τ-years (details in supplement). This probability, |${\pi _F}( Z ) = P{\rm{[}}F \ge \tau \ {\rm{|}}\ Z]$|, where |$F$| denotes the observed follow-up, can be computed using a logistic regression model with the phase I risk factors as predictors. SEs were appropriately adjusted to account for the additional uncertainty introduced by these random weights (details in supplement).

### Simulation methods

Data for a cohort study with 50,000 individuals were simulated on the basis of a model including eight independent risk factors: |${X_1},{X_2},{X_3},{X_4}$| are continuously distributed as standard normal variates, and |${X_5},{X_6},{X_7},{X_8}$| are binary, distributed as Bernoulli random variables each with probability |$P = 0.5$|. Let |${{\bf X}} = ( {{X_1}, \ldots ,\ {X_8}} )$| and |${{\bf \beta }}$| be the associated log-relative-risk set to |${{\bf \beta }} = {( {\log ( {1.1} ),\log ( {1.1} ),$| |$\log ( {1.1} ),\, \, {\log ( {1.1} ),\, \, \log ( {1.2} ),\,\, \log ( {1.2} ),\ - \log ( {1.2} ),\ - \log ( {1.2} )} )^{{T}}}.\ $| The age of disease onset |$T$| was generated from the Cox model: |$\lambda ( {t{\rm{|}}{{\bf X}}} ) = \ {\lambda _0}( t ){\rm{exp}}( {{{\bf X\beta }}} )$|, with |${\lambda _0}( t ) = \lambda \gamma {t^{\gamma - 1}}$| being the hazard function of a |$Weibull( {\lambda ,\gamma } )$| distribution. The parameters |$\lambda $| and |$\gamma $| were chosen such that the probabilities of developing the disease by age |$50$| and |$70$| years were approximately |$5\% $| and |$12\% $|, respectively. We generated age of entry and observed follow-up from separate discrete uniform distributions between 40 and 60 years and 19 and 21 years, respectively. We assume that each subject is disease-free at study entry.

We use alternative designs to select a nested case–control sample from the full cohort and assume that data on two risk factors, |${X_7},{X_8}$|, are available only on the subsample. The sampling designs considered were (i) simple case–control sampling, that is, selecting random samples of cases and controls from the cohort, and (ii) stratified case–control sampling, that is, selecting cases and controls by frequency matching at the deciles of the partial risk-score. In each design, we vary the fraction of cases |$( \eta )$| sampled over the range 1, 0.75, 0.5, 0.25 keeping the case–control ratio approximately |$1\!:\!1$|. The selection probabilities of the controls are empirically estimated from each simulation (details in supplement). We define |$f$|, as the ratio of the variance of the partial risk-score |$( {{S_{{\rm{obs}}}}} )$| and the variance of the full risk-score |$( S )$| and use it as an index to quantify the relative importance of the phase I risk factors compared with the entire set of risk factors. For our chosen |${{\bf \beta }}$|, we have |$f \approx 0.75$|. We vary |$f$| in the set 0.1, 0.2, 0.4, 0.5, 0.75 by modifying the values of the Cox model parameters.

For each simulation setting, we generate 1,000 cohort datasets with complete risk factor information and compute the AUC from each dataset as the empirical proportion of all possible case–control pairings for which the risk-score for the case is greater than that of the control. The average of the 1,000 AUC values was considered as the true AUC of the underlying model. Finite sample performance of the proposed estimator was assessed, and its efficiency was evaluated relative to the IPW estimator.

### Impact of incomplete follow-up

We performed additional simulations to assess the impact of incomplete follow-up on the estimation of AUC for a model predicting risk over 10 years. We slightly modified our simulation setting to generate a cohort of 50,000 subjects, all of whom were assumed to be followed up for at least |$10\ $|years and a subject was classified as case/control depending on whether (s)he developed the disease by 10 years. Since all subjects in this setting were followed beyond the specified time interval, the target AUC could be unambiguously defined as the probability of the risk-score for the cases being greater than that of the controls. We considered scenarios with moderate association of risk factors with disease (i.e., setting log-relative risk at |${\rm{\beta }}$| specified in simulation setting) and strong association of risk factors with disease (i.e., setting log-relative risk at |$3 \times {\rm{\beta }}$|).

In the same simulation setting, we then introduced random variation in the follow-up time and first considered scenarios where it did not depend on risk factors. We generated observed follow-up from the truncated normal distribution between 0 and 20 years with mean 12 years and standard deviation 4 years. We also considered scenarios where observed follow-up depended on risk factors, by adding a linear predictor term to the mean of truncated normal distribution. The percentage of variability of observed follow-up explained by this linear predictor was set at 1, 5, 20. We used simple case–control sampling with fraction of cases 0.5 and a case–control ratio 1:1 to generate the nested case–control study. In this setting, we evaluated the bias and standard error (over 1,000 simulations) of the unadjusted two-phase estimator, where case–control status was defined on the basis of whether a subject developed the disease within 10 years, ignoring incomplete follow-up. We also evaluated an adjusted two-phase estimator that uses information on all the cases diagnosed within 10 years and a subset of the controls with at least 10 years of follow-up and accounted for this selection using the inverse of probability of follow-up beyond 10 years.

## Results

### Lung cancer example

The median follow-up time of the smokers in the PLCO screening arm was nearly 12 years with an interquartile range of 2.6 years. They were categorized on the basis of age at enrollment: ≤59 years, 60–64 years, 65–69 years, ≥70 years. The risk-score explained less than 1% variability of the observed follow-up in each subcohort. Table 1 shows the AUC estimates and the SEs in these subcohorts based on three combinations of risk factors: (i) questionnaire-based risk factors only, (ii) questionnaire-based risk factors and CRP, and (iii) questionnaire-based risk factors and all four inflammation biomarkers. For computing the proposed AUC estimator, we stratified the partial risk-score into six categories based on sextiles. For combination (i), we evaluated the standard AUC estimator accounting for incomplete follow-up based on all smokers who developed lung cancer within 10 years or were followed up for at least 10 years in the PLCO screening arm (phase I sample). The AUC estimates for each type of model were lower in the oldest age group compared with the other groups. For combinations (ii) and (iii), comparison of the proposed and IPW estimators showed that although point estimates were similar within the limits of uncertainty, the precision of the proposed estimator was much higher across age categories. Moreover, the inclusion of inflammation biomarkers led to some improvement of AUC in the older age groups.

. | . | AUC estimate (SE) . | |||||
---|---|---|---|---|---|---|---|

. | Number of subjects (number of cases) . | . | Questionnaire-based risk factors + CRP . | Questionnaire-based risk factors + four inflammation biomarkers . | |||

Age (in years) . | Phase I^{a}
. | Phase II^{b}
. | Questionnaire-based risk factors only^{c}
. | TPS . | IPW . | TPS . | IPW . |

≤59 | 10,616 (247) | 189 (101) | 0.792 (0.013) | 0.81 (0.017) | 0.791 (0.043) | 0.791 (0.02) | 0.778 (0.046) |

60–64 | 9,793 (363) | 255 (135) | 0.762 (0.012) | 0.76 (0.016) | 0.758 (0.032) | 0.759 (0.017) | 0.752 (0.033) |

65–69 | 6,693 (402) | 251 (137) | 0.789 (0.011) | 0.808 (0.014) | 0.796 (0.038) | 0.811 (0.017) | 0.799 (0.038) |

≥70 | 3,195 (241) | 148 (83) | 0.713 (0.016) | 0.725 (0.019) | 0.723 (0.05) | 0.738 (0.021) | 0.736 (0.048) |

Total | 30,297 (1,253) | 843 (456) |

. | . | AUC estimate (SE) . | |||||
---|---|---|---|---|---|---|---|

. | Number of subjects (number of cases) . | . | Questionnaire-based risk factors + CRP . | Questionnaire-based risk factors + four inflammation biomarkers . | |||

Age (in years) . | Phase I^{a}
. | Phase II^{b}
. | Questionnaire-based risk factors only^{c}
. | TPS . | IPW . | TPS . | IPW . |

≤59 | 10,616 (247) | 189 (101) | 0.792 (0.013) | 0.81 (0.017) | 0.791 (0.043) | 0.791 (0.02) | 0.778 (0.046) |

60–64 | 9,793 (363) | 255 (135) | 0.762 (0.012) | 0.76 (0.016) | 0.758 (0.032) | 0.759 (0.017) | 0.752 (0.033) |

65–69 | 6,693 (402) | 251 (137) | 0.789 (0.011) | 0.808 (0.014) | 0.796 (0.038) | 0.811 (0.017) | 0.799 (0.038) |

≥70 | 3,195 (241) | 148 (83) | 0.713 (0.016) | 0.725 (0.019) | 0.723 (0.05) | 0.738 (0.021) | 0.736 (0.048) |

Total | 30,297 (1,253) | 843 (456) |

Abbreviations: CRP, C-reactive protein; IPW, inverse probability weighted estimator; TPS, two-phase estimator.

^{a}Phase I subjects are the ever-smokers ages 55–74 years in the screening arm of the PLCO trial who were diagnosed with lung cancer within 10 years or who were followed up for at least 10 years.

^{b}Phase II subjects are those phase I subjects included in the replication study (nested case–control study within PLCO screening arm).

^{c}For questionnaire-based risk factor only model, the standard AUC estimator based on phase I sample is reported after accounting for incomplete follow-up using the inverse of the probability that a subject is followed for at least 10 years.

### Simulation results

Table 2 shows the simulation results evaluating the performance of the proposed and the IPW estimators under alternative sampling schemes with sampling fraction for cases *η* = 1, 0.5, 0.25 and |$f = 0.75$|. In all the scenarios, both the estimators had very small bias and the confidence intervals, constructed using influence function-based variance estimates, achieved the nominal 95% level. The percent bias in the SE estimate was also very small. Moreover, the proposed estimator was much more efficient than the IPW estimator. When the second-phase sample included all the cases and a random sample of the controls, the proposed approach led to nearly 50% efficiency gain. Compared with simple case–control sampling, stratified case–control sampling of the second-phase subjects led to modest efficiency loss. Both the estimators performed similarly in settings where binary risk factors had probabilities less than 0.5 (Supplementary Tables S1 and S2) and when a continuous risk factor *X*_{4} and a binary risk factor *X*_{5} was available only in phase II sample (Supplementary Table S5).

. | Bias . | Empirical SE (×10^{3})^{b}
. | Average of SE estimate (×10^{3})^{c}
. | Coverage . | ||||
---|---|---|---|---|---|---|---|---|

Sampling scheme . | TPS . | IPW . | TPS . | IPW . | TPS . | IPW . | TPS . | IPW . |

Simple case–control sampling (case–control ratio 1:1) | ||||||||

Sampling fraction of cases = 1 | 9.56 × 10^{−5} | 0.0002 | 5.1 | 6.3 | 5.3 | 6.5 | 0.96 | 0.94 |

Sampling fraction of cases = 0.5 | 0.0002 | −4.5 × 10^{−5} | 6.2 | 9.1 | 6.3 | 9.2 | 0.95 | 0.96 |

Sampling fraction of cases = 0.25 | 0.0002 | 0.0004 | 7.5 | 12.9 | 7.9 | 13 | 0.96 | 0.95 |

Stratified case–control sampling (case–control ratio 1:1) | ||||||||

Sampling fraction of cases 1 | 7.08 × 10^{−5} | −6.39 × 10^{−5} | 5.6 | 6.8 | 5.3 | 6.6 | 0.94 | 0.94 |

Sampling fraction of cases 0.5 | 5.74 × 10^{−5} | 0.0001 | 6.2 | 9 | 6.3 | 9.3 | 0.96 | 0.96 |

Sampling fraction of cases 0.25 | −0.0002 | 0.0001 | 7.6 | 13.1 | 7.9 | 13.1 | 0.96 | 0.95 |

. | Bias . | Empirical SE (×10^{3})^{b}
. | Average of SE estimate (×10^{3})^{c}
. | Coverage . | ||||
---|---|---|---|---|---|---|---|---|

Sampling scheme . | TPS . | IPW . | TPS . | IPW . | TPS . | IPW . | TPS . | IPW . |

Simple case–control sampling (case–control ratio 1:1) | ||||||||

Sampling fraction of cases = 1 | 9.56 × 10^{−5} | 0.0002 | 5.1 | 6.3 | 5.3 | 6.5 | 0.96 | 0.94 |

Sampling fraction of cases = 0.5 | 0.0002 | −4.5 × 10^{−5} | 6.2 | 9.1 | 6.3 | 9.2 | 0.95 | 0.96 |

Sampling fraction of cases = 0.25 | 0.0002 | 0.0004 | 7.5 | 12.9 | 7.9 | 13 | 0.96 | 0.95 |

Stratified case–control sampling (case–control ratio 1:1) | ||||||||

Sampling fraction of cases 1 | 7.08 × 10^{−5} | −6.39 × 10^{−5} | 5.6 | 6.8 | 5.3 | 6.6 | 0.94 | 0.94 |

Sampling fraction of cases 0.5 | 5.74 × 10^{−5} | 0.0001 | 6.2 | 9 | 6.3 | 9.3 | 0.96 | 0.96 |

Sampling fraction of cases 0.25 | −0.0002 | 0.0001 | 7.6 | 13.1 | 7.9 | 13.1 | 0.96 | 0.95 |

Abbreviations: IPW, inverse probability weighted estimator; TPS, two-phase estimator.

^{a}Approximately 92% of study subjects are censored. The true value of AUC based on all the risk factors in the underlying model is 0.577 and the empirical SE based on 1,000 simulated cohort datasets is 0.0048. Results are shown under alternative sampling schemes with varying sampling probabilities and a fixed value of *f* = 0.75. The true AUC based on the partial risk score is 0.567.

^{b}Empirical SE (×10^{3}): 10^{3} times the empirical SE of estimated AUC over 1,000 simulations.

^{c}Average of SE estimate (×10^{3}): 10^{3} times the mean of the estimated standard errors across 1,000 simulations. SEs are estimated using the influence function-based variance estimator.

Figure 2 shows the relative efficiency of the proposed estimator compared with the IPW estimator as a function of |$\eta $| under simple case–control sampling of the second-phase subjects. For fixed *f*, the relative efficiency increased as *η* decreased because the IPW estimator failed to incorporate information from the increasing number of unselected subjects. Moreover, for fixed *η*, the relative efficiency increased with increase in *f* , as the partial risk-score explained a larger proportion of the variability of the full risk-score and the proposed estimator gained efficiency by using the partial risk-score from phase I subjects (Tables 2 and 3).

. | Bias . | Empirical SE (×10^{3})^{b}
. | Average of SE estimate (×10^{3})^{c}
. | Coverage . | ||||
---|---|---|---|---|---|---|---|---|

Sampling scheme . | TPS . | IPW . | TPS . | IPW . | TPS . | IPW . | TPS . | IPW . |

Simple case–control sampling (case–control ratio 1:1) | ||||||||

Sampling fraction of cases = 1 | 5.85 × 10^{−5} | 0.0002 | 6 | 6.7 | 5.8 | 6.5 | 0.94 | 0.94 |

Sampling fraction of cases = 0.5 | 0.0004 | 0.0005 | 7.4 | 9.2 | 7.6 | 9.2 | 0.96 | 0.95 |

Sampling fraction of cases = 0.25 | 0.0002 | 0.0005 | 10.8 | 13.2 | 10.3 | 13.1 | 0.95 | 0.95 |

Stratified case–control sampling (case–control ratio 1:1) | ||||||||

Sampling fraction of cases 1 | 0.0001 | 0.0002 | 5.7 | 6.3 | 5.8 | 6.6 | 0.95 | 0.95 |

Sampling fraction of cases 0.5 | 0.0001 | 2.36 × 10^{−5} | 7.6 | 9.2 | 7.6 | 9.3 | 0.95 | 0.96 |

Sampling fraction of cases 0.25 | 0.0003 | 5.95 × 10^{−5} | 10.3 | 13 | 10.3 | 13.1 | 0.96 | 0.96 |

. | Bias . | Empirical SE (×10^{3})^{b}
. | Average of SE estimate (×10^{3})^{c}
. | Coverage . | ||||
---|---|---|---|---|---|---|---|---|

Sampling scheme . | TPS . | IPW . | TPS . | IPW . | TPS . | IPW . | TPS . | IPW . |

Simple case–control sampling (case–control ratio 1:1) | ||||||||

Sampling fraction of cases = 1 | 5.85 × 10^{−5} | 0.0002 | 6 | 6.7 | 5.8 | 6.5 | 0.94 | 0.94 |

Sampling fraction of cases = 0.5 | 0.0004 | 0.0005 | 7.4 | 9.2 | 7.6 | 9.2 | 0.96 | 0.95 |

Sampling fraction of cases = 0.25 | 0.0002 | 0.0005 | 10.8 | 13.2 | 10.3 | 13.1 | 0.95 | 0.95 |

Stratified case–control sampling (case–control ratio 1:1) | ||||||||

Sampling fraction of cases 1 | 0.0001 | 0.0002 | 5.7 | 6.3 | 5.8 | 6.6 | 0.95 | 0.95 |

Sampling fraction of cases 0.5 | 0.0001 | 2.36 × 10^{−5} | 7.6 | 9.2 | 7.6 | 9.3 | 0.95 | 0.96 |

Sampling fraction of cases 0.25 | 0.0003 | 5.95 × 10^{−5} | 10.3 | 13 | 10.3 | 13.1 | 0.96 | 0.96 |

Abbreviations: IPW, inverse probability weighted estimator; TPS, two-phase estimator.

^{a}Approximately 92% of study subjects are censored. The true value of AUC of the underlying model is 0.596 and the empirical SE based on 1,000 simulated cohort datasets is 0.0047. Results are shown under alternative sampling schemes with varying sampling probabilities and a fixed value of *f* = 0.5; the true AUC based on the partial risk score is 0.568.

^{b}Empirical SE (×10^{3}): 10^{3} times the empirical SE of estimated AUC over 1,000 simulations.

^{c}Average of SE estimate (×10^{3}): 10^{3} times the mean of the estimated SEs across 1,000 simulations. SEs are estimated using the influence function-based variance estimator.

Table 4 shows the impact of incomplete follow-up in the estimation of AUC for a model predicting 10-year risk using the proposed method. Incomplete follow-up created noticeable bias in the unadjusted two-phase estimator only when the underlying disease risk-score was fairly strongly associated with the follow-up time (i.e., dependent censoring). When the risk-score was strongly predictive of the disease and the levels of AUCs were high (e.g., 0.714), the percentage biases were small. But when the risk-score was only weakly predictive of the disease and the level of AUCs were moderate (e.g., 0.575), the percentage biases were more notable. In such scenarios, the adjusted two-phase estimator that accounts for incomplete follow-up had a lower bias but had slightly bigger variances compared with the unadjusted two-phase estimator that ignored incomplete follow-up. The SEs for both estimators were slightly higher in settings with moderate discrimination as opposed to high discrimination.

. | Moderate association of risk factors on disease . | Strong association of risk factors on disease . | ||
---|---|---|---|---|

. | Target^{a} 10-year
. | Target^{a} 10-year
. | ||

. | AUC = 0.575 . | AUC = 0.714 . | ||

Percentage of variability of observed follow-up explained by risk-score^{b}
. | Unadjusted for censoring . | Adjusted for censoring . | Unadjusted for censoring . | Adjusted for censoring . |

0 | 0.575 (0.0111) | 0.575 (0.0107) | 0.712 (0.0101) | 0.712 (0.0101) |

1 | 0.578 (0.0111) | 0.575 (0.011) | 0.715 (0.01) | 0.712 (0.0101) |

5 | 0.582 (0.011) | 0.574 (0.011) | 0.718 (0.0099) | 0.712 (0.0099) |

20 | 0.589 (0.011) | 0.573 (0.0115) | 0.723 (0.0099) | 0.71 (0.0105) |

. | Moderate association of risk factors on disease . | Strong association of risk factors on disease . | ||
---|---|---|---|---|

. | Target^{a} 10-year
. | Target^{a} 10-year
. | ||

. | AUC = 0.575 . | AUC = 0.714 . | ||

Percentage of variability of observed follow-up explained by risk-score^{b}
. | Unadjusted for censoring . | Adjusted for censoring . | Unadjusted for censoring . | Adjusted for censoring . |

0 | 0.575 (0.0111) | 0.575 (0.0107) | 0.712 (0.0101) | 0.712 (0.0101) |

1 | 0.578 (0.0111) | 0.575 (0.011) | 0.715 (0.01) | 0.712 (0.0101) |

5 | 0.582 (0.011) | 0.574 (0.011) | 0.718 (0.0099) | 0.712 (0.0099) |

20 | 0.589 (0.011) | 0.573 (0.0115) | 0.723 (0.0099) | 0.71 (0.0105) |

^{a}The target AUCs over a 10-year time interval were based on simulated cohort where all subjects were followed for at least 10 years.

^{b}We generated follow-up time such that subjects can be censored prior to specified time interval. We allowed the follow-up time to potentially depend on the disease risk-score. We show the average and SE (over 1,000 simulated cohort datasets) of the unadjusted two-phase estimator that ignores the incomplete nature of follow-up. In the same setting, we also report the average and SE (over 1,000 simulated cohort datasets) of the adjusted two-phase estimator that uses information from all the cases and a subset of the controls with at least 10 years of follow-up and accounts for selection using the inverse of probability of a control subject being followed up for at least 10 years.

## Discussion

Modern cohort studies of cancer often collect information on expensive biomarkers in a subsample of subjects, leading to a two-phase data structure. Two-phase data structures can also arise in complex observational data settings, for example, with electronic health records in integrated healthcare systems (34). In such settings, we propose a simple estimator of AUC that gains efficiency by combining complete risk factor information from the second-phase sample and partial risk factor information from subjects not included in the second phase and accounts for incomplete follow-up of subjects. This study demonstrates the enhanced precision of the proposed estimator compared with an IPW approach that uses information only from the subsample. This implies that AUC estimation based on a single validation study using the proposed method will be more likely to be closer to the truth.

The AUC measures how well a model discriminates between cases and controls. It may not alone be adequate to assess suitability of models for clinical applications in risk-stratified cancer prevention. A model needs to be also assessed for calibration, that is, whether it can produce unbiased estimate of risk (35–37). Moreover, there are other measures of discriminatory performance that can more directly measure clinical utility of models (38–40). Although we focus only on the popular metric AUC, the framework can be generalized for evaluating other metrics assessing model calibration and discrimination in two-phase studies.

Our work is focused on evaluating a given risk prediction model with the regression parameters supplied externally, a problem intrinsically different from efficient estimation of regression parameters. Intuitively, the former involves efficient evaluation of univariate scalar risk-scores derived from potentially numerous risk factors. However, such dimension reduction may not be very efficient in the latter problem. Because of the intrinsic difference in the target quantity of interest, most of the existing methods (e.g., ref. 41) are not directly applicable in our setting. Some existing methods consider study settings with “verification” bias where the disease diagnosis itself may be available on a subset of subjects in the main study (42–51). Although it is possible that some of underlying concepts are transportable, these methods are not directly applicable in our setting with missing data on risk factors.

We considered simple binary classification of disease status over a fixed prediction interval, for example, whether a patient is diagnosed with lung cancer in 10 years. Future research should explore possible extensions of this method to estimate time-dependent measures of risk model discrimination. For example, the method proposed by Zheng and colleagues (17), to estimate time-dependent AUC essentially employs IPW analysis of the phase II data, with the sampling probability being estimated nonparametrically using phase I data to achieve the efficiency gain. However, in settings with numerous phase I covariates as is common in studies of cancer epidemiology, estimation of sampling probabilities, nonparametrically, conditional on all the phase I covariates, will be infeasible. Our proposed method, which uses the partial risk-score for dimension reduction, in principle, can also be utilized in such time-to-event setting, but this requires more detailed investigation.

The estimation of conditional probabilities accounts for the nonrandom sampling of second-phase subjects, assuming that the sampling probabilities are known by design. Simulation studies show that estimation of sampling probability, as opposed to using known values, does not alter the precision of AUC estimation in the proposed method (Supplementary Table S3). In studies of cancer epidemiology, where design information is unknown, it may be required to estimate these probabilities under parametric assumptions. Double robust estimation (e.g., ref. 52) could be more relevant in such settings as it guarantees unbiasedness if the selection mechanism or the distribution of missing risk factors is correctly specified.

Violation of assumptions underlying our method can lead to bias. In the lung cancer example, there was evidence of violation of the assumption that the disease status and the phase II risk factors are conditionally independent of any sampling selection variables given the partial risk-score. Simulation studies, however, showed moderate bias even under explicit violation of assumption by incorporating “matching factors,” strongly related to disease and certain risk factors, but themselves not part of the risk-score (Supplementary Table S4). One can consider stratified estimation of model evaluation statistics within categories of matching factors.

The empirical estimates of conditional probabilities based on binning into categories involves subjectivity in the choice of the number of categories (e.g., deciles) and is not fully nonparametric. Since our approach, based on scalar partial risk-scores, already involves dimension reduction, one can construct such estimates based on kernel smoothing. Under suitable regularity conditions (53–55), the two estimators will be asymptotically equivalent when the window/bin lengths are allowed to decrease at suitable rates, but in small samples the estimator based on kernel smoothing could be more efficient.

Our simulation results indicate that when the second-phase subjects are selected by frequency-matching of cases and controls within deciles of partial risk-score as opposed to simple case–control sampling, there is no efficiency gain of the two-phase estimator compared with the IPW estimator. Frequency-matching of cases and controls, within categories of a matching factor, may lead to increased precision in estimating associations of certain risk factors (e.g., exposure of interest). However, loss of information on the association between the matching factor and disease at the design stage may result in inefficiency in the estimation of overall measures of model validation (e.g., AUC) that requires robust estimation of parameters associated with all the risk factors, including potential matching factors.

In summary, we have proposed a novel, simple to implement and practically useful method for enhancing precision of discrimination measures of risk prediction. This is particularly useful in studies of risk model validation, a critical step toward risk-stratified cancer prevention.

## Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

## Authors' Contributions

**Conception and design:** P. Pal Choudhury, N. Chatterjee

**Development of methodology:** P. Pal Choudhury, N. Chatterjee

**Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.):** A.K. Chaturvedi

**Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis):** P. Pal Choudhury, A.K. Chaturvedi

**Writing, review, and/or revision of the manuscript:** P. Pal Choudhury, A.K. Chaturvedi, N. Chatterjee

**Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases):** P. Pal Choudhury

**Study supervision:** P. Pal Choudhury, N. Chatterjee

## Acknowledgments

The first author would like to thank Dr. Mustapha Abubakar (DCEG, NCI) for his helpful comments in improving the presentation of manuscript. The works of P. Pal Choudhury and N. Chatterjee were supported by the Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1602-34530). The work of A.K. Chaturvedi was supported by the Intramural Research Program, Division of Cancer Epidemiology and Genetics, NCI, NIH, Department of Health and Human Services. The statements and opinions in this article are solely the responsibility of the authors and do not necessarily represent the views of the PCORI, its Board of Governors, or Methodology Committee or the NCI, NIH, or the Department of Health and Human Services.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked *advertisement* in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.