Abstract
The effectiveness and efficiency of cancer screening in real-world settings depend on many factors, including test sensitivity and specificity. Outside of select experimental studies, not everyone receives a gold standard test that can serve as a comparator in estimating screening test accuracy. Thus, many studies of screening test accuracy use the passage of time to infer whether or not cancer was present at the time of the screening test, particularly for patients with a negative screening test. We define the accuracy assessment interval as the period of time after a screening test that is used to estimate the test's accuracy. We describe how the length of this interval may bias sensitivity and specificity estimates. We call for future research to quantify bias and uncertainty in accuracy estimates and to provide guidance on setting accuracy assessment interval lengths for different cancers and screening modalities.
Epidemiologists have an important role in evaluating healthcare interventions, such as screening. The effectiveness and efficiency of cancer screening depends on many factors, including test sensitivity and specificity. Estimating screening test sensitivity and specificity using observational data deserves greater attention in the epidemiologic literature, particularly with respect to a concept that we herein define as the accuracy assessment interval.
To estimate sensitivity and specificity, a perfect gold standard test would, ideally, be administered to everyone at the same time as the screening test being evaluated. However, this usually does not occur in practice. With an invasive test (e.g., biopsy) as the gold standard, only persons who screen positive are likely to have screening results verified, which can lead to verification bias (1, 2). Studies of screening test accuracy often use the passage of time to infer whether or not cancer was present at the time of the screening test, particularly for patients with a negative screening test (2–4). This is an example of differential verification bias: some people receive a gold standard test (e.g., biopsy or imaging), while others receive an imperfect referent standard (i.e., presence or absence of a cancer diagnosis during a particular time interval; ref. 5). We define the accuracy assessment interval as the period of time after a screening test that is used to estimate its accuracy. In this commentary, we explore how the length of the accuracy assessment interval may contribute to bias in estimates of screening test sensitivity and specificity.
The chosen length of the accuracy assessment interval can affect accuracy estimates, especially sensitivity. For example, Hofvind and colleagues estimated the sensitivity of mammography using 2-year and 1-year follow-up intervals (74.9% vs. 82.0%, respectively; ref. 6). There were no differences in specificity. In a study of hemoccult testing for colorectal cancer, Allison and colleagues reported sensitivities of 50%, 43%, and 25% using 1-year, 2-year, and 4-year follow-up periods, respectively (7). Specificity did not differ meaningfully among the follow-up period (98.8% for 1-year and 2-year follow-up periods and 98.7% for 4-year follow-up period). Others have noted that the “optimal duration of follow-up has not been standardized” (4) and that short follow-up intervals might miss cancers that were truly present at screening while long intervals might include cancers that developed after screening (4, 8). But to our knowledge, this issue has not been addressed in detail. We posit that one should attempt to select an accuracy assessment interval, or intervals, which will help them most accurately estimate the screening test's true sensitivity and specificity. Ideally, the accuracy assessment interval is long enough such that any cancer present at the time of screening will be diagnosed during the interval, while also short enough that new cancers are unlikely to develop, be detected during the interval, and be falsely classified as having been present at the time of the screening exam. For example, if the accuracy assessment interval for fecal immunochemical test (FIT) is set to 2 years, we would want the following conditions to be met:
(i) If a person does not truly have colorectal cancer at the time of a screening FIT, the person will not develop cancer and be diagnosed within 2 years.
(ii) If a person truly has colorectal cancer at the time of a screening FIT, the cancer will be diagnosed within 2 years.
It is unlikely that these conditions will be met for everyone included in an analysis and there will therefore be error in the estimates of sensitivity and specificity; thus, the question becomes how to minimize these errors. Table 1 shows screening test classification based on observed data (i.e., screening test result and accuracy assessment interval classification) and the truth. Classification according to these three factors allows us to conceptualize screening test results as: correct true positives (cTP), incorrect true positives (iTP), correct false positives (cFP), incorrect false positives (iFP), correct true negatives (cTN), incorrect true negatives (iTN), correct false negatives (cFN), and incorrect false negatives (iFN). The terms correct versus incorrect describe the agreement between the assessment interval classification and the truth. Positive versus negative refer to the screening test result. True versus false describe agreement between the screening test results and the accuracy assessment interval classification. For example, a person with a negative FIT who is not diagnosed with cancer during the accuracy assessment interval (e.g., 2 years) is a cTN if cancer was truly absent at the time of the negative FIT and an iTN if cancer was truly present at the time of the FIT. A person with a positive FIT who is not diagnosed with cancer during the accuracy assessment interval is a cFP if cancer was truly absent at the time of the negative FIT and in iFP if cancer was truly present at the time of the FIT.
Screening test result . | Cancer diagnosed during accuracy assessment interval . | No cancer diagnosed during accuracy assessment interval . |
---|---|---|
Positive | Cancer is truly present: cTP and | Cancer is truly absent: cFP and |
Cancer is truly absent: iTP (should be false positives) | Cancer is truly present: iFP (should be true positives) | |
Misclassifying false positives as true positives:
| Misclassifying true positives as false positives:
| |
Negative | Cancer is truly present: cFN and | Cancer is truly absent: cTN and |
Cancer is truly absent: iFN (should be true negatives) | Cancer is truly present: iTN (should be false negatives) | |
Misclassifying true negatives as false negatives:
| Misclassifying false negatives as true negatives:
|
Screening test result . | Cancer diagnosed during accuracy assessment interval . | No cancer diagnosed during accuracy assessment interval . |
---|---|---|
Positive | Cancer is truly present: cTP and | Cancer is truly absent: cFP and |
Cancer is truly absent: iTP (should be false positives) | Cancer is truly present: iFP (should be true positives) | |
Misclassifying false positives as true positives:
| Misclassifying true positives as false positives:
| |
Negative | Cancer is truly present: cFN and | Cancer is truly absent: cTN and |
Cancer is truly absent: iFN (should be true negatives) | Cancer is truly present: iTN (should be false negatives) | |
Misclassifying true negatives as false negatives:
| Misclassifying false negatives as true negatives:
|
Observed sensitivity and specificity depend, in part, on the relative frequency of different types of errors (i.e., misclassifying false positives as true positives [cFP→iTP], true positives as false positives [cTP→iFP], false negatives as true negatives [cFN→iTN], and true negatives as false negatives [cTN→iFN]) as given by the equations below.
There are tradeoffs associated with lengthening and shortening the accuracy assessment interval. With a longer accuracy assessment interval, we are more likely to correctly classify a cancer that is present at the time of the screening test as “present.” Some negative screening tests shift from being classified as true negatives to false negatives (which decreases estimated sensitivity and specificity) and some positive screening tests shift from being classified as false positives to true positives (which increases estimated sensitivity and specificity). For example:
Increasing the length of the accuracy assessment interval risks misclassifying TNs as FNs (cTN→iFN) and misclassifying FPs as TPs (cFP→iTP). As a result, we might mistakenly conclude that new cancers that developed during the accuracy assessment interval had been present at the time of the screening test.
Increasing the length of the accuracy assessment interval helps correctly identify FNs (i.e., iTN→cFN) and correctly identify TPs (i.e., iFP→cTP). Having a longer accuracy assessment interval helps identify cancers that were truly present at the time of the screening test.
The reverse occurs as the length of the accuracy interval is shortened. There is, thus, an inherent tradeoff between lengthening versus shortening the accuracy assessment interval. Table 2 and Supplementary Fig. S1 show the complexity of potential problems caused by accuracy assessment intervals that are too long or too short.
Accuracy assessment interval length . | Problem . | Possible causes . | Impact on estimates of sensitivity and specificity . |
---|---|---|---|
Too long | Cancers that develop and are detected after screening are incorrectly classified as having been present at screening | Rate of new cancer development is fast relative to accuracy assessment interval length | Misclassifying false positives as true positives (cFP→iTP) increases estimated sensitivity and specificity |
Misclassifying true negatives as false negatives (cTN→iFN) decreases estimated sensitivity and specificity | |||
Among people with a negative screening test, cancers detected at a subsequent screening test are incorrectly classified as having been present at the initial screening test | Recommended screening interval is shorter than accuracy assessment interval | Misclassifying true negatives as false negatives (cTN→iFN) decreases estimated sensitivity and specificity | |
Too short | Cancers that were truly present at screening are not detected during the accuracy assessment interval | People with positive screening tests do not have sufficiently rapid follow-up diagnostic tests relative to accuracy assessment interval length | Misclassifying true positives as false positives (cTP→iFP) decreases estimated sensitivity and specificity |
Slow progression from asymptomatic cancer to symptom-detected cancer | Misclassifying true positives as false positives (cTP→iFP) decreases estimated sensitivity and specificity | ||
Misclassifying false negatives as true negatives (cFN→iTN) increases estimated sensitivity and specificity | |||
Recommended screening interval is longer than accuracy assessment interval, which decreases opportunities for detection | Misclassifying false negatives as true negatives (cFN→iTN) increases estimated sensitivity and specificity |
Accuracy assessment interval length . | Problem . | Possible causes . | Impact on estimates of sensitivity and specificity . |
---|---|---|---|
Too long | Cancers that develop and are detected after screening are incorrectly classified as having been present at screening | Rate of new cancer development is fast relative to accuracy assessment interval length | Misclassifying false positives as true positives (cFP→iTP) increases estimated sensitivity and specificity |
Misclassifying true negatives as false negatives (cTN→iFN) decreases estimated sensitivity and specificity | |||
Among people with a negative screening test, cancers detected at a subsequent screening test are incorrectly classified as having been present at the initial screening test | Recommended screening interval is shorter than accuracy assessment interval | Misclassifying true negatives as false negatives (cTN→iFN) decreases estimated sensitivity and specificity | |
Too short | Cancers that were truly present at screening are not detected during the accuracy assessment interval | People with positive screening tests do not have sufficiently rapid follow-up diagnostic tests relative to accuracy assessment interval length | Misclassifying true positives as false positives (cTP→iFP) decreases estimated sensitivity and specificity |
Slow progression from asymptomatic cancer to symptom-detected cancer | Misclassifying true positives as false positives (cTP→iFP) decreases estimated sensitivity and specificity | ||
Misclassifying false negatives as true negatives (cFN→iTN) increases estimated sensitivity and specificity | |||
Recommended screening interval is longer than accuracy assessment interval, which decreases opportunities for detection | Misclassifying false negatives as true negatives (cFN→iTN) increases estimated sensitivity and specificity |
Table 3 presents a hypothetical example showing how changing the accuracy assessment interval can affect estimates of sensitivity and specificity. In this hypothetical population with 1% cancer prevalence, true sensitivity and specificity are, respectively, 80.0% and 98.0%. We assume that during a 6-month accuracy assessment interval, 0.02% of cancer-free people develop and are diagnosed with cancer and that 70% who screened positive receive a cancer-confirming follow-up test. We assume that during a 12-month accuracy assessment interval 0.05% of cancer-free people develop and are diagnosed with cancer and that 90% who screened positive received a cancer-confirming follow-up test. In this particular hypothetical example, the estimates of specificity are quite similar (and close to the truth) for both accuracy assessment intervals. Sensitivity is underestimated using both accuracy assessment intervals, but to a greater degree with the shorter accuracy assessment interval. Different assumptions would yield different patterns; thus, the table is primarily intended to show that different accuracy assessment intervals can indeed give rise to different accuracy estimates.
Unobserved truth . | . | . | |||
---|---|---|---|---|---|
Cancer | No cancer | Total | |||
Screen + | 8 | 20 | 28 | ||
Screen – | 2 | 970 | 972 | ||
Total | 10 | 990 | 1,000 | ||
Sensitivity | 80.0% | ||||
Specificity | 98.0% | ||||
Observed with 6-month accuracy assessment intervala | |||||
Cancer | No cancer | Total | TP misclassified as FP | 2.400 | |
Screen + | 5.604 | 22.396 | 28 | TN misclassified as FN | 0.194 |
Screen – | 2.194 | 969.806 | 972 | FP misclassified as TP | 0.004 |
Total | 7.798 | 992.202 | 1,000 | FN misclassified as TN | 0 |
Sensitivity | 71.9% | ||||
Specificity | 97.7% | ||||
Observed with 12-month accuracy assessment intervalb | |||||
Total | Cancer | No cancer | Total | TP misclassified as FP | 0.800 |
Screen + | 7.210 | 20.790 | 28 | TN misclassified as FN | 0.485 |
Screen – | 2.485 | 969.515 | 972 | FP misclassified as TP | 0.010 |
Total | 9.695 | 990.305 | 1,000 | FN misclassified as TN | 0 |
Sensitivity | 74.4% | ||||
Specificity | 97.9% |
Unobserved truth . | . | . | |||
---|---|---|---|---|---|
Cancer | No cancer | Total | |||
Screen + | 8 | 20 | 28 | ||
Screen – | 2 | 970 | 972 | ||
Total | 10 | 990 | 1,000 | ||
Sensitivity | 80.0% | ||||
Specificity | 98.0% | ||||
Observed with 6-month accuracy assessment intervala | |||||
Cancer | No cancer | Total | TP misclassified as FP | 2.400 | |
Screen + | 5.604 | 22.396 | 28 | TN misclassified as FN | 0.194 |
Screen – | 2.194 | 969.806 | 972 | FP misclassified as TP | 0.004 |
Total | 7.798 | 992.202 | 1,000 | FN misclassified as TN | 0 |
Sensitivity | 71.9% | ||||
Specificity | 97.7% | ||||
Observed with 12-month accuracy assessment intervalb | |||||
Total | Cancer | No cancer | Total | TP misclassified as FP | 0.800 |
Screen + | 7.210 | 20.790 | 28 | TN misclassified as FN | 0.485 |
Screen – | 2.485 | 969.515 | 972 | FP misclassified as TP | 0.010 |
Total | 9.695 | 990.305 | 1,000 | FN misclassified as TN | 0 |
Sensitivity | 74.4% | ||||
Specificity | 97.9% |
Abbreviations: FP, false positive; FN, false negative; TP, true positive; TN, true negative.
aWe assume that during a 6-month accuracy assessment interval, 0.02% of the no-cancer group develops and is diagnosed with cancer and that 70% of the screen-positive group receives a cancer-confirming follow-up test.
bWe assume that during a 12-month accuracy assessment interval 0.05% of the no-cancer group develops and is diagnosed with cancer and that 90% of the screen-positive group receives a cancer-confirming follow-up test.
Studies that compute sensitivity based on cancers diagnosed between screening rounds (9–14) implicitly use the screening interval as the accuracy assessment interval. This approach is intuitive and reasonable, but it may not always be the best choice. Apparent interval cancers (i.e., those that occur after a negative screening test and before the next screening test) likely include both those that were missed at a screening test (false negatives) as well as de novo cancers. Thus, although the observed interval cancer rate is an important screening quality measure, it is not a pure measure of test sensitivity and may also have limitations with respect to computing specificity. Thus, the screening interval and accuracy assessment interval need not be the same length. They are distinct concepts that serve different purposes. However, setting the accuracy interval to be the same as the screening interval may have some advantages, including increasing the likelihood that a cancer missed by the first screening test will be diagnosed (i.e., that false negatives are correctly classified as such rather than misclassified as true negatives). Future work should comprehensively (and quantitatively) evaluate the benefits and drawbacks of using the screening interval as the accuracy assessment interval. Factors to consider include adherence (particularly differential adherence) to screening guidelines, disease natural history, and the implications of different screening intervals across screening modalities for a particular cancer.
There is need for guidance in the literature about how to set the length of the accuracy assessment interval. Doing so requires information or assumptions about: (i) how rapidly most new cancers develop; (ii) how long it takes new cancers to become symptomatic and/or detectable; (iii) if/when people will present for follow-up testing after a positive screen and for diagnostic testing if cancer symptoms are present, (iv) the recommended screening interval, and (5) rates of loss to follow-up. Many nuances need consideration, such as how to establish accuracy assessment interval(s) when comparing different screening modalities (e.g., FIT versus screening colonoscopy; Pap test alone versus co-testing with Pap and HPV testing).
We acknowledge that there is unlikely to be a perfect accuracy assessment interval for a particular screening test. For example, three years might be within the time frame needed to detect missed cancers but past the point at which some new cancers develop. Ultimately, setting the length of the accuracy assessment interval is a decision based on weighing these tradeoffs. Future studies (both empirical and simulation based) should investigate how to correct for bias and incorporate uncertainty in estimates due to the inherent challenges in having to artificially set an accuracy assessment interval. Existing research on verification bias and imperfect gold standards (15) may help epidemiologists develop guidance and tools to set accuracy assessment intervals and quantify the resulting bias and uncertainty in sensitivity and specificity estimates.
Authors' Disclosures
J. Chubak reports grants from NIH during the conduct of the study; grants from Amgen, Inc. outside the submitted work. A.N. Burnett-Hartman reports grants from NCI at the NIH during the conduct of the study. W.E. Barlow reports grants from NCI during the conduct of the study. D.A. Corley reports grants from NCI during the conduct of the study. C. Neslund-Dudas reports grants from NIH grant during the conduct of the study. A. Vachani reports grants from NIH/NCI during the conduct of the study; personal fees from Johnson & Johnson; grants from MagArray, Broncus Medical; and grants from PreCyte outside the submitted work. J.A. Tiro reports grants from NCI/NIH during the conduct of the study. A. Kamineni reports grants from NCI during the conduct of the study. No disclosures were reported by the other authors.
Disclaimer
The views expressed here are those of the authors only and do not represent any official position of the NCI or NIH.
Acknowledgments
The authors thank the participating PROSPR II Research Centers. A list of the PROSPR II investigators and contributing research staff is provided at: http://healthcaredelivery.cancer.gov/prospr/. The authors are also grateful to Dr. Rebecca Hubbard for her comments on the manuscript. This manuscript was written as part of the NCI-funded consortium Population-based Research to Optimize the Screening Process (PROSPR II) consortium. The overall aim of PROSPR II is to conduct multisite, coordinated, transdisciplinary research to evaluate and improve cervical, colorectal, and lung cancer screening processes. The three PROSPR II Research Centers and their associated sites reflect the diversity of US delivery system organizations. UM1CA222035 (principal investigators: Chubak, Corley, Halm, Kamineni, Schottinger, Skinner), UM1CA229140 (principal investigators: Haas, Kamineni, Tiro), UM1CA221939 (principal investigators: Ritzwoller, Vachani), UM24CA221936 (principal investigators: Li and Zheng).