The effectiveness and efficiency of cancer screening in real-world settings depend on many factors, including test sensitivity and specificity. Outside of select experimental studies, not everyone receives a gold standard test that can serve as a comparator in estimating screening test accuracy. Thus, many studies of screening test accuracy use the passage of time to infer whether or not cancer was present at the time of the screening test, particularly for patients with a negative screening test. We define the accuracy assessment interval as the period of time after a screening test that is used to estimate the test's accuracy. We describe how the length of this interval may bias sensitivity and specificity estimates. We call for future research to quantify bias and uncertainty in accuracy estimates and to provide guidance on setting accuracy assessment interval lengths for different cancers and screening modalities.

Epidemiologists have an important role in evaluating healthcare interventions, such as screening. The effectiveness and efficiency of cancer screening depends on many factors, including test sensitivity and specificity. Estimating screening test sensitivity and specificity using observational data deserves greater attention in the epidemiologic literature, particularly with respect to a concept that we herein define as the accuracy assessment interval.

To estimate sensitivity and specificity, a perfect gold standard test would, ideally, be administered to everyone at the same time as the screening test being evaluated. However, this usually does not occur in practice. With an invasive test (e.g., biopsy) as the gold standard, only persons who screen positive are likely to have screening results verified, which can lead to verification bias (1, 2). Studies of screening test accuracy often use the passage of time to infer whether or not cancer was present at the time of the screening test, particularly for patients with a negative screening test (2–4). This is an example of differential verification bias: some people receive a gold standard test (e.g., biopsy or imaging), while others receive an imperfect referent standard (i.e., presence or absence of a cancer diagnosis during a particular time interval; ref. 5). We define the accuracy assessment interval as the period of time after a screening test that is used to estimate its accuracy. In this commentary, we explore how the length of the accuracy assessment interval may contribute to bias in estimates of screening test sensitivity and specificity.

The chosen length of the accuracy assessment interval can affect accuracy estimates, especially sensitivity. For example, Hofvind and colleagues estimated the sensitivity of mammography using 2-year and 1-year follow-up intervals (74.9% vs. 82.0%, respectively; ref. 6). There were no differences in specificity. In a study of hemoccult testing for colorectal cancer, Allison and colleagues reported sensitivities of 50%, 43%, and 25% using 1-year, 2-year, and 4-year follow-up periods, respectively (7). Specificity did not differ meaningfully among the follow-up period (98.8% for 1-year and 2-year follow-up periods and 98.7% for 4-year follow-up period). Others have noted that the “optimal duration of follow-up has not been standardized” (4) and that short follow-up intervals might miss cancers that were truly present at screening while long intervals might include cancers that developed after screening (4, 8). But to our knowledge, this issue has not been addressed in detail. We posit that one should attempt to select an accuracy assessment interval, or intervals, which will help them most accurately estimate the screening test's true sensitivity and specificity. Ideally, the accuracy assessment interval is long enough such that any cancer present at the time of screening will be diagnosed during the interval, while also short enough that new cancers are unlikely to develop, be detected during the interval, and be falsely classified as having been present at the time of the screening exam. For example, if the accuracy assessment interval for fecal immunochemical test (FIT) is set to 2 years, we would want the following conditions to be met:

  • (i)  If a person does not truly have colorectal cancer at the time of a screening FIT, the person will not develop cancer and be diagnosed within 2 years.

  • (ii)  If a person truly has colorectal cancer at the time of a screening FIT, the cancer will be diagnosed within 2 years.

It is unlikely that these conditions will be met for everyone included in an analysis and there will therefore be error in the estimates of sensitivity and specificity; thus, the question becomes how to minimize these errors. Table 1 shows screening test classification based on observed data (i.e., screening test result and accuracy assessment interval classification) and the truth. Classification according to these three factors allows us to conceptualize screening test results as: correct true positives (cTP), incorrect true positives (iTP), correct false positives (cFP), incorrect false positives (iFP), correct true negatives (cTN), incorrect true negatives (iTN), correct false negatives (cFN), and incorrect false negatives (iFN). The terms correct versus incorrect describe the agreement between the assessment interval classification and the truth. Positive versus negative refer to the screening test result. True versus false describe agreement between the screening test results and the accuracy assessment interval classification. For example, a person with a negative FIT who is not diagnosed with cancer during the accuracy assessment interval (e.g., 2 years) is a cTN if cancer was truly absent at the time of the negative FIT and an iTN if cancer was truly present at the time of the FIT. A person with a positive FIT who is not diagnosed with cancer during the accuracy assessment interval is a cFP if cancer was truly absent at the time of the negative FIT and in iFP if cancer was truly present at the time of the FIT.

Table 1.

Screening accuracy classification based on observed screening results, observed cancer diagnoses during accuracy assessment interval, and true cancer status at screening.

Screening test resultCancer diagnosed during accuracy assessment intervalNo cancer diagnosed during accuracy assessment interval
Positive Cancer is truly present: cTP and Cancer is truly absent: cFP and 
 Cancer is truly absent: iTP (should be false positives) Cancer is truly present: iFP (should be true positives) 
 Misclassifying false positives as true positives:
  • → increases estimated sensitivity by inflating its numerator

  • → increases estimated specificity by deflating its denominator

 
Misclassifying true positives as false positives:
  • → decreases estimated sensitivity by deflating its numerator

  • → decreases estimated specificity by inflating its denominator

 
Negative Cancer is truly present: cFN and Cancer is truly absent: cTN and 
 Cancer is truly absent: iFN (should be true negatives) Cancer is truly present: iTN (should be false negatives) 
 Misclassifying true negatives as false negatives:
  • → decreases estimated sensitivity by inflating its denominator

  • → decreases estimated specificity by deflating its numerator

 
Misclassifying false negatives as true negatives:
  • → increases estimated sensitivity by deflating its denominator

  • → increases estimated specificity by inflating its numerator

 
Screening test resultCancer diagnosed during accuracy assessment intervalNo cancer diagnosed during accuracy assessment interval
Positive Cancer is truly present: cTP and Cancer is truly absent: cFP and 
 Cancer is truly absent: iTP (should be false positives) Cancer is truly present: iFP (should be true positives) 
 Misclassifying false positives as true positives:
  • → increases estimated sensitivity by inflating its numerator

  • → increases estimated specificity by deflating its denominator

 
Misclassifying true positives as false positives:
  • → decreases estimated sensitivity by deflating its numerator

  • → decreases estimated specificity by inflating its denominator

 
Negative Cancer is truly present: cFN and Cancer is truly absent: cTN and 
 Cancer is truly absent: iFN (should be true negatives) Cancer is truly present: iTN (should be false negatives) 
 Misclassifying true negatives as false negatives:
  • → decreases estimated sensitivity by inflating its denominator

  • → decreases estimated specificity by deflating its numerator

 
Misclassifying false negatives as true negatives:
  • → increases estimated sensitivity by deflating its denominator

  • → increases estimated specificity by inflating its numerator

 

Observed sensitivity and specificity depend, in part, on the relative frequency of different types of errors (i.e., misclassifying false positives as true positives [cFP→iTP], true positives as false positives [cTP→iFP], false negatives as true negatives [cFN→iTN], and true negatives as false negatives [cTN→iFN]) as given by the equations below.

There are tradeoffs associated with lengthening and shortening the accuracy assessment interval. With a longer accuracy assessment interval, we are more likely to correctly classify a cancer that is present at the time of the screening test as “present.” Some negative screening tests shift from being classified as true negatives to false negatives (which decreases estimated sensitivity and specificity) and some positive screening tests shift from being classified as false positives to true positives (which increases estimated sensitivity and specificity). For example:

  • Increasing the length of the accuracy assessment interval risks misclassifying TNs as FNs (cTN→iFN) and misclassifying FPs as TPs (cFP→iTP). As a result, we might mistakenly conclude that new cancers that developed during the accuracy assessment interval had been present at the time of the screening test.

  • Increasing the length of the accuracy assessment interval helps correctly identify FNs (i.e., iTN→cFN) and correctly identify TPs (i.e., iFP→cTP). Having a longer accuracy assessment interval helps identify cancers that were truly present at the time of the screening test.

The reverse occurs as the length of the accuracy interval is shortened. There is, thus, an inherent tradeoff between lengthening versus shortening the accuracy assessment interval. Table 2 and Supplementary Fig. S1 show the complexity of potential problems caused by accuracy assessment intervals that are too long or too short.

Table 2.

Impact on estimated sensitivity and specificity from accuracy assessment intervals that are too long or too short.

Accuracy assessment interval lengthProblemPossible causesImpact on estimates of sensitivity and specificity
Too long Cancers that develop and are detected after screening are incorrectly classified as having been present at screening Rate of new cancer development is fast relative to accuracy assessment interval length Misclassifying false positives as true positives (cFP→iTP) increases estimated sensitivity and specificity 
   Misclassifying true negatives as false negatives (cTN→iFN) decreases estimated sensitivity and specificity 
 Among people with a negative screening test, cancers detected at a subsequent screening test are incorrectly classified as having been present at the initial screening test Recommended screening interval is shorter than accuracy assessment interval Misclassifying true negatives as false negatives (cTN→iFN) decreases estimated sensitivity and specificity 
Too short Cancers that were truly present at screening are not detected during the accuracy assessment interval People with positive screening tests do not have sufficiently rapid follow-up diagnostic tests relative to accuracy assessment interval length Misclassifying true positives as false positives (cTP→iFP) decreases estimated sensitivity and specificity 
  Slow progression from asymptomatic cancer to symptom-detected cancer Misclassifying true positives as false positives (cTP→iFP) decreases estimated sensitivity and specificity 
   Misclassifying false negatives as true negatives (cFN→iTN) increases estimated sensitivity and specificity 
  Recommended screening interval is longer than accuracy assessment interval, which decreases opportunities for detection Misclassifying false negatives as true negatives (cFN→iTN) increases estimated sensitivity and specificity 
Accuracy assessment interval lengthProblemPossible causesImpact on estimates of sensitivity and specificity
Too long Cancers that develop and are detected after screening are incorrectly classified as having been present at screening Rate of new cancer development is fast relative to accuracy assessment interval length Misclassifying false positives as true positives (cFP→iTP) increases estimated sensitivity and specificity 
   Misclassifying true negatives as false negatives (cTN→iFN) decreases estimated sensitivity and specificity 
 Among people with a negative screening test, cancers detected at a subsequent screening test are incorrectly classified as having been present at the initial screening test Recommended screening interval is shorter than accuracy assessment interval Misclassifying true negatives as false negatives (cTN→iFN) decreases estimated sensitivity and specificity 
Too short Cancers that were truly present at screening are not detected during the accuracy assessment interval People with positive screening tests do not have sufficiently rapid follow-up diagnostic tests relative to accuracy assessment interval length Misclassifying true positives as false positives (cTP→iFP) decreases estimated sensitivity and specificity 
  Slow progression from asymptomatic cancer to symptom-detected cancer Misclassifying true positives as false positives (cTP→iFP) decreases estimated sensitivity and specificity 
   Misclassifying false negatives as true negatives (cFN→iTN) increases estimated sensitivity and specificity 
  Recommended screening interval is longer than accuracy assessment interval, which decreases opportunities for detection Misclassifying false negatives as true negatives (cFN→iTN) increases estimated sensitivity and specificity 

Table 3 presents a hypothetical example showing how changing the accuracy assessment interval can affect estimates of sensitivity and specificity. In this hypothetical population with 1% cancer prevalence, true sensitivity and specificity are, respectively, 80.0% and 98.0%. We assume that during a 6-month accuracy assessment interval, 0.02% of cancer-free people develop and are diagnosed with cancer and that 70% who screened positive receive a cancer-confirming follow-up test. We assume that during a 12-month accuracy assessment interval 0.05% of cancer-free people develop and are diagnosed with cancer and that 90% who screened positive received a cancer-confirming follow-up test. In this particular hypothetical example, the estimates of specificity are quite similar (and close to the truth) for both accuracy assessment intervals. Sensitivity is underestimated using both accuracy assessment intervals, but to a greater degree with the shorter accuracy assessment interval. Different assumptions would yield different patterns; thus, the table is primarily intended to show that different accuracy assessment intervals can indeed give rise to different accuracy estimates.

Table 3.

Hypothetical example of the impact on estimated sensitivity and specificity of using different accuracy assessment interval lengths.

Unobserved truth
 Cancer No cancer Total   
Screen + 20 28   
Screen – 970 972   
Total 10 990 1,000   
Sensitivity 80.0%     
Specificity 98.0%     
Observed with 6-month accuracy assessment intervala 
 Cancer No cancer Total TP misclassified as FP 2.400 
Screen + 5.604 22.396 28 TN misclassified as FN 0.194 
Screen – 2.194 969.806 972 FP misclassified as TP 0.004 
Total 7.798 992.202 1,000 FN misclassified as TN 
Sensitivity 71.9%     
Specificity 97.7%     
Observed with 12-month accuracy assessment intervalb 
Total Cancer No cancer Total TP misclassified as FP 0.800 
Screen + 7.210 20.790 28 TN misclassified as FN 0.485 
Screen – 2.485 969.515 972 FP misclassified as TP 0.010 
Total 9.695 990.305 1,000 FN misclassified as TN 
Sensitivity 74.4%     
Specificity 97.9%     
Unobserved truth
 Cancer No cancer Total   
Screen + 20 28   
Screen – 970 972   
Total 10 990 1,000   
Sensitivity 80.0%     
Specificity 98.0%     
Observed with 6-month accuracy assessment intervala 
 Cancer No cancer Total TP misclassified as FP 2.400 
Screen + 5.604 22.396 28 TN misclassified as FN 0.194 
Screen – 2.194 969.806 972 FP misclassified as TP 0.004 
Total 7.798 992.202 1,000 FN misclassified as TN 
Sensitivity 71.9%     
Specificity 97.7%     
Observed with 12-month accuracy assessment intervalb 
Total Cancer No cancer Total TP misclassified as FP 0.800 
Screen + 7.210 20.790 28 TN misclassified as FN 0.485 
Screen – 2.485 969.515 972 FP misclassified as TP 0.010 
Total 9.695 990.305 1,000 FN misclassified as TN 
Sensitivity 74.4%     
Specificity 97.9%     

Abbreviations: FP, false positive; FN, false negative; TP, true positive; TN, true negative.

aWe assume that during a 6-month accuracy assessment interval, 0.02% of the no-cancer group develops and is diagnosed with cancer and that 70% of the screen-positive group receives a cancer-confirming follow-up test.

bWe assume that during a 12-month accuracy assessment interval 0.05% of the no-cancer group develops and is diagnosed with cancer and that 90% of the screen-positive group receives a cancer-confirming follow-up test.

Studies that compute sensitivity based on cancers diagnosed between screening rounds (9–14) implicitly use the screening interval as the accuracy assessment interval. This approach is intuitive and reasonable, but it may not always be the best choice. Apparent interval cancers (i.e., those that occur after a negative screening test and before the next screening test) likely include both those that were missed at a screening test (false negatives) as well as de novo cancers. Thus, although the observed interval cancer rate is an important screening quality measure, it is not a pure measure of test sensitivity and may also have limitations with respect to computing specificity. Thus, the screening interval and accuracy assessment interval need not be the same length. They are distinct concepts that serve different purposes. However, setting the accuracy interval to be the same as the screening interval may have some advantages, including increasing the likelihood that a cancer missed by the first screening test will be diagnosed (i.e., that false negatives are correctly classified as such rather than misclassified as true negatives). Future work should comprehensively (and quantitatively) evaluate the benefits and drawbacks of using the screening interval as the accuracy assessment interval. Factors to consider include adherence (particularly differential adherence) to screening guidelines, disease natural history, and the implications of different screening intervals across screening modalities for a particular cancer.

There is need for guidance in the literature about how to set the length of the accuracy assessment interval. Doing so requires information or assumptions about: (i) how rapidly most new cancers develop; (ii) how long it takes new cancers to become symptomatic and/or detectable; (iii) if/when people will present for follow-up testing after a positive screen and for diagnostic testing if cancer symptoms are present, (iv) the recommended screening interval, and (5) rates of loss to follow-up. Many nuances need consideration, such as how to establish accuracy assessment interval(s) when comparing different screening modalities (e.g., FIT versus screening colonoscopy; Pap test alone versus co-testing with Pap and HPV testing).

We acknowledge that there is unlikely to be a perfect accuracy assessment interval for a particular screening test. For example, three years might be within the time frame needed to detect missed cancers but past the point at which some new cancers develop. Ultimately, setting the length of the accuracy assessment interval is a decision based on weighing these tradeoffs. Future studies (both empirical and simulation based) should investigate how to correct for bias and incorporate uncertainty in estimates due to the inherent challenges in having to artificially set an accuracy assessment interval. Existing research on verification bias and imperfect gold standards (15) may help epidemiologists develop guidance and tools to set accuracy assessment intervals and quantify the resulting bias and uncertainty in sensitivity and specificity estimates.

J. Chubak reports grants from NIH during the conduct of the study; grants from Amgen, Inc. outside the submitted work. A.N. Burnett-Hartman reports grants from NCI at the NIH during the conduct of the study. W.E. Barlow reports grants from NCI during the conduct of the study. D.A. Corley reports grants from NCI during the conduct of the study. C. Neslund-Dudas reports grants from NIH grant during the conduct of the study. A. Vachani reports grants from NIH/NCI during the conduct of the study; personal fees from Johnson & Johnson; grants from MagArray, Broncus Medical; and grants from PreCyte outside the submitted work. J.A. Tiro reports grants from NCI/NIH during the conduct of the study. A. Kamineni reports grants from NCI during the conduct of the study. No disclosures were reported by the other authors.

The views expressed here are those of the authors only and do not represent any official position of the NCI or NIH.

The authors thank the participating PROSPR II Research Centers. A list of the PROSPR II investigators and contributing research staff is provided at: http://healthcaredelivery.cancer.gov/prospr/. The authors are also grateful to Dr. Rebecca Hubbard for her comments on the manuscript. This manuscript was written as part of the NCI-funded consortium Population-based Research to Optimize the Screening Process (PROSPR II) consortium. The overall aim of PROSPR II is to conduct multisite, coordinated, transdisciplinary research to evaluate and improve cervical, colorectal, and lung cancer screening processes. The three PROSPR II Research Centers and their associated sites reflect the diversity of US delivery system organizations. UM1CA222035 (principal investigators: Chubak, Corley, Halm, Kamineni, Schottinger, Skinner), UM1CA229140 (principal investigators: Haas, Kamineni, Tiro), UM1CA221939 (principal investigators: Ritzwoller, Vachani), UM24CA221936 (principal investigators: Li and Zheng).

1.
Pepe
MS
,
Alonzo
TA
.
Comparing disease screening tests when true disease status is ascertained only for screen positives
.
Biostatistics
2001
;
2
:
249
60
.
2.
O'Sullivan
JW
,
Banerjee
A
,
Heneghan
C
,
Pluddemann
A
.
Verification bias
.
BMJ Evid Based Med
2018
;
23
:
54
5
.
3.
Lin
JS
,
Piper
MA
,
Perdue
LA
,
Rutter
CM
,
Webber
EM
,
O'Connor
E
, et al
.
Screening for colorectal cancer: updated evidence report and systematic review for the US preventive services task force
.
JAMA
2016
;
315
:
2576
94
.
4.
Rosman
AS
,
Korsten
MA
.
Effect of verification bias on the sensitivity of fecal occult blood testing: a meta-analysis
.
J Gen Intern Med
2010
;
25
:
1211
21
.
5.
Alonzo
TA
,
Brinton
JT
,
Ringham
BM
,
Glueck
DH
.
Bias in estimating accuracy of a binary screening test with differential disease verification
.
Stat Med
2011
;
30
:
1852
64
.
6.
Hofvind
S
,
Geller
BM
,
Skelly
J
,
Vacek
PM
.
Sensitivity and specificity of mammographic screening as practiced in Vermont and Norway
.
Br J Radiol
2012
;
85
:
e1226
32
.
7.
Allison
JE
,
Feldman
R
,
Tekawa
IS
.
Hemoccult screening in detecting colorectal neoplasm: sensitivity, specificity, and predictive value. Long-term follow-up in a large group practice setting
.
Ann Intern Med
1990
;
112
:
328
33
.
8.
Glueck
DH
,
Lamb
MM
,
O'Donnell
CI
,
Ringham
BM
,
Brinton
JT
,
Muller
KE
, et al
.
Bias in trials comparing paired continuous tests can cause researchers to choose the wrong screening modality
.
BMC Med Res Method
2009
;
9
:
4
.
9.
Blom
J
,
Tornberg
S
.
Interval cancers in a guaiac-based colorectal cancer screening program: consequences on sensitivity
.
J Med Screen
2017
;
24
:
146
52
.
10.
Bordas
P
,
Jonsson
H
,
Nystrom
L
,
Lenner
P
.
Interval cancer incidence and episode sensitivity in the Norrbotten Mammography Screening Programme, Sweden
.
J Med Screen
2009
;
16
:
39
45
.
11.
Elena
PM
,
Nehmat
H
,
Ermes
M
,
Piera
C
,
Maria
Q
,
Guia
M
, et al
.
Quality of mammography screening in the Milan programme: evidence of improved sensitivity based on interval cancer proportional incidence and radiological review
.
Breast
2009
;
18
:
208
10
.
12.
Hakama
M
,
Auvinen
A
,
Day
NE
,
Miller
AB
.
Sensitivity in cancer screening
.
J Med Screen
2007
;
14
:
174
7
.
13.
Sarkeala
T
,
Hakama
M
,
Saarenmaa
I
,
Hakulinen
T
,
Forsman
H
,
Anttila
A
.
Episode sensitivity in association with process indicators in the Finnish breast cancer screening program
.
Int J Cancer
2006
;
118
:
174
9
.
14.
Zorzi
M
,
Guzzinati
S
,
Puliti
D
,
Paci
E
.
A simple method to estimate the episode and program sensitivity of breast cancer screening programs
.
J Med Screen
2010
;
17
:
132
8
.
15.
Umemneku Chikere
CM
,
Wilson
KJ
,
Allen
AJ
,
Vale
L
.
Comparative diagnostic accuracy studies with an imperfect reference standard - a comparison of correction methods
.
BMC Med Res Method
2021
;
21
:
67
.

Supplementary data