Background: Survey data used to study trends in cancer screening may overestimate screening utilization while potentially underestimating existing disparities in use.
Methods: We did a literature review and meta-analysis of validation studies examining the accuracy of self-reported cancer-screening histories. We calculated summary random-effects estimates for sensitivity and specificity, separately for mammography, clinical breast exam (CBE), Pap smear, prostate-specific antigen testing (PSA), digital rectal exam, fecal occult blood testing, and colorectal endoscopy.
Results: Sensitivity was highest for mammogram, CBE, and Pap smear (0.95, 0.94, and 0.93, respectively) and lowest for PSA and digital rectal exam histories (0.71 and 0.75). Specificity was highest for endoscopy, fecal occult blood testing, and PSA (0.90, 0.78, and 0.73, respectively) and lowest for CBE, Pap smear, and mammogram histories (0.26, 0.48, and 0.61, respectively). Sensitivity and specificity summary estimates tended to be lower in predominantly Black and Hispanic samples compared with predominantly White samples. When estimates of self-report accuracy from this meta-analysis were applied to cancer-screening prevalence estimates from the National Health Interview Survey, results suggested that prevalence estimates are artificially increased and disparities in prevalence are artificially decreased by inaccurate self-reports.
Conclusions: National survey data are overestimating cancer-screening utilization for several common procedures and may be masking disparities in screening due to racial/ethnic differences in reporting accuracy. (Cancer Epidemiol Biomarkers Prev 2008;17(4):748–57)
Screening for breast, cervical, and colorectal cancer are generally recognized as effective in reducing morbidity and mortality from these cancers, and both the U.S. Preventive Services Task Force (USPSTF) and the American Cancer Society (ACS) provide recommendations regarding screening frequencies for each. ACS currently recommends annual Pap smears to detect cervical cancer, beginning at age 21 years, and annual mammograms to detect breast cancer, beginning at age 40 years (1), whereas USPSTF recommends Pap smears every 3 years and mammograms every 1 to 2 years (2, 3). For detection of colorectal cancer, each recommends either fecal occult blood test (FOBT) annually (ACS) or biannually (USPSTF); sigmoidoscopy every 5 years; or colonoscopy every 10 years, beginning at age 50 years (1, 4). For prostate cancer, ACS (but not USPSTF) recommends annual digital rectal exam (DRE) and prostate-specific antigen (PSA) testing at age 50 years for men at average risk and at age 45 years for African-American men (1). USPSTF recommends neither for nor against PSA screening.
Currently, epidemiologic surveys are the most common method used for monitoring trends in population compliance with these screening recommendations. According to data from the 2000 National Health Interview Survey (NHIS), ∼82% of women over age 25 years obtained a Pap smear within the past 3 years, 70% obtained a mammogram in the past 2 years, and 41% of adults obtained colorectal cancer screening in accordance with guidelines (5). The Healthy People goals call for rates of adherence to national cancer-screening guidelines of 97% for Pap smear, 70% for biannual mammography, and 50% for colorectal cancer screening by 2010 (6). Another set of goals for Healthy People 2010 relates to reducing racial/ethnic and socioeconomic disparities in cancer screening. These goals implicitly assume that the data that are relied upon to estimate national rates of cancer screening are accurate.
Numerous validation studies have documented the levels of agreement/disagreement between survey reports of cancer-screening practices and actual behavior as documented in patient charts, laboratory records, and other auxiliary sources (7-61). We conducted a literature review and meta-analysis of studies validating cancer-screening self-reports. Our aims were to establish estimates of accuracy that could be used to inform the interpretation of national survey data on cancer-screening utilization, and that could be used by clinicians when deciding whether to refer a patient for screening based on self-reported prior screening. We also sought to determine whether survey methodology, documentation strategy, sample race/ethnicity, and other sample sociodemographic characteristics were associated with self-report accuracy.
Materials and Methods
We searched for articles published in Medline between January 1, 1966, and July 2005. We searched for article titles containing any of the words “accuracy,” “validity,” “specificity,” “sensitivity,” “reliability,” or “reproducibility” along with titles containing either “self-report,” “recall,” or “patient reports” and one of the following: “cancer,” “cervical,” “pap,” “colon,” “colorectal,” “fobt,” “fecal,” “prostate,” “psa,” “breast,” “mammo,” “skin,” or “melanoma.” We searched the references of identified articles to locate additional articles of interest that might have been missed in the original search. We identified 55 articles presenting data on self-reported and medically documented screening histories (7-61). We excluded 10 studies that were not conducted in the United States (7, 9, 10, 11, 18, 44, 45, 53, 56, 57), including one study each on clinical skin cancer exam and stomach fluoroscopy (7, 53). We excluded three additional studies that compared self-reported screening rates in one sample of individuals with medically documented screening rates in another sample because accuracy measures could not be calculated (37, 38, 60). In addition, one study, conducted within the United States, validated the date of the most recent screening exam (61) without calculating accuracy measures for specific time frames and was excluded.
These 14 exclusions left 41 studies of the accuracy of self-reported breast, cervical, prostate, and colorectal cancer screening. From these studies, we limited our analysis to the 29 studies that collected data on all four contingencies (true positives, false positives, false negatives, and true negatives) and could therefore estimate sensitivity and specificity (and positive predictive value; refs. 8, 12-15, 17, 20, 22, 25-27, 30-32, 34-36, 39-41, 43, 47, 49, 50, 52, 54, 55, 58, 59).
Self-report time frames varied from study to study. Two studies asked subjects whether they had received a test on the same day as the interview (17, 43), whereas other studies asked subjects to report if they had ever had the procedure (47, 54, 58). The most frequent time frame was for the past 2 years (14, 20, 25, 26, 27, 31, 35, 45), followed by the past 1, 3, and 5 years, respectively. For studies that calculated measures for multiple time frames, we included only one time frame in our analysis to reduce dependencies in our data and to ensure that no single study contributed more information to the meta-analysis simply because it obtained multiple time frames on the same individuals. When forced to make this choice, we chose a reporting interval that corresponded closely to national screening guidelines for a given procedure; (a) 3 years for Pap smear reports; (b) 5 years for colorectal endoscopy; and (c) 2 years for other screening procedures. Several of the 29 studies included in our analysis reported accuracy estimates for more than one sample subgroup; in all, the 29 included studies contained 39 sample subgroups in our final analysis data set.
From each study, we abstracted the following: first author, year of publication, mid-point year of interviews, and whether the validation study seemed to be part of a larger intervention study or a validation substudy from a larger sample of respondents.
For each study subgroup, we abstracted the following: interview mode (face-to-face, telephone, self-administered report), interview location (home, clinic), type of facility from which medical documentation was obtained, and the number of “don't know” responses and how they were handled in the analysis. We also abstracted sample distribution of age, sex, education, and race/ethnicity. We categorized samples as dominant (>70%) White, Hispanic, Black, Native American, Asian American, mixed non-White dominant, and race/ethnicity not specified. We abstracted types of documentation sought from these facilities, including “charts”; “medical records”; “outpatient medical records”; “medical record audits”; “facility records”; and pathology, radiology, and “computerized databases,” “HMO records,” and many other brief and relatively uninformative descriptions of the documentation process.
For each self-reported versus documented screening history, we identified the number of positive reports that were documented as positive (true positives), the number of positive reports that were documented as negative (false positives), the number of negative reports that were documented as positive (false negatives), and the number of negative reports that were documented as negative (true negatives). From these numbers, we calculated values for sensitivity, specificity, and positive predictive value. Sensitivity was defined as the probability that a history documented as positive was reported as positive (number of true positives divided by the number of true positives and false negatives). Specificity was defined as the probability that a history documented as negative was reported as negative (number of true negatives divided by the number of true negatives plus false positives). Positive predictive value was defined as the probability that a positive self-report was documented as positive (number of true positives divided by the number of true positives plus false positives). We included positive predictive value in our analyses because it may be of particular interest to clinicians. Reports-to-records ratio was defined as the percentage of self-reported positive screening histories divided by the percentage of documented positive histories. A report-to-records ratio greater than 1 was consistent with net overreporting of positive histories by respondents, and a report-to-records ratio less than 1 was consistent with net underreporting. In a few instances, one of the two cells involved in a given measure had a value of zero, prohibiting the simple calculation of a SE. In these rare instances, we added one observation to each cell and recalculated the accuracy estimate.
Sensitivity, specificity, and positive predictive value were each nonnormally distributed. After taking a logit-transformation of the individual accuracy estimates, distributions were approximately normal. We estimated the SE of each logit-transformed estimate, and plotted point estimates and 95% confidence intervals on the probability scale in order of increasing SE.
Summary Estimates Screening by Procedure
We used the meta command in Stata version 9 (StataCorp) to generate summary estimates of sensitivity, specificity, and positive predictive value for each screening procedure by taking weighted averages of individual study/subgroup estimates. We estimated both “fixed-effects” and “random-effects” summary estimates. Fixed-effects estimates were generated by taking an inverse-variance weighted average of the study-specific, logit-transformed estimates of accuracy. Fixed-effects estimates give more weight to larger studies and assume that each is an estimate of a common value. Random-effects estimates (also known as DerSimonian and Laird estimates) assume that there are true differences in estimates between studies (62). This between-study variance is estimated and then added to the denominator of the precision weights. The result is that larger and smaller studies are more evenly weighted and the confidence interval for the summary effect is usually wider than for the corresponding fixed-effect estimate. Random-effects estimation also provides a test for homogeneity in study-/subgroup-specific estimates. P values for the vast majority of tests of homogeneity were highly statistically significant (P < 0.0005), indicating a great deal of heterogeneity among study estimates; therefore, we relied on random-effects summary estimates when interpreting results.
Individual validation studies of cancer-screening histories, especially substudies undertaken to support results from a larger study, may not be published if accuracy is determined to be poor. The resulting absence of these studies in the literature would tend to bias summary estimates upward in the meta-analysis. To test for the possibility of publication bias, we constructed funnel plots and calculated Egger's statistics as implemented in Stata's metabias and metafunnel commands (63). When the P value for the bias statistic was <0.10, we implemented Stata's metatrim command, which uses a nonparametric “trim and fill” method to reestimate the summary estimate, after accounting for possible publication bias (64).
Corrected Estimates from NHIS 2005
Table 1 lists the 29 studies and 39 study strata that contributed a total of 87 observations to the final analysis data set. Individual point estimates and 95% confidence intervals for sensitivity, specificity, and positive predictive value are presented in Fig. 1 for self-report histories of mammogram, Pap smear, FOBT, and colorectal endoscopy. There was a great deal of variation in estimated accuracy across studies for a given screening procedure. Sensitivity was positively correlated with the prevalence of self-reported use and with positive predictive value, but inversely correlated with specificity (ρ = 0.76, 0.60, and −0.49, respectively). In addition, specificity was inversely correlated with both prevalence of self-reported use and with positive predictive value (ρ = −0.86 and −0.24, respectively; results not shown). The mean (median) values for reports-to-records ratio were 1.2 (1.1) for clinician breast exam, 1.2 (1.2) for mammography, 1.4 (1.2) for Pap smear, 0.9 (0.9) for PSA testing, 1.3 (1.2) for DRE, 1.4 (1.4) for FOBT, and 2.2 (1.7) for colorectal endoscopy (results not shown).
Summary estimates of fixed and random effects are presented in Table 2. In almost every case, there was highly statistically significant between-estimate variance; therefore, we relied on the random-effects estimates when interpreting results. Sensitivity was highest for mammogram, clinical breast exam (CBE), and Pap smear (0.95, 0.94, and 0.93, respectively) and lowest for PSA and colorectal screening histories. Specificity was highest for endoscopy, FOBT, and PSA (0.90, 0.78, and 0.73, respectively) and lowest for CBE, Pap smear, and mammogram histories (0.26, 0.48, and 0.61, respectively). For the 21 random-effects summary estimates across the seven screening procedures and three measures, we found eight instances of possible publication bias, six of which seemed to inflate the accuracy estimate by 2 to 8 percentage points, and two that seemed to lower the estimate by 4 and 8 points, respectively (Table 2).
Socioeconomic and Study Design Factors
Information on socioeconomic status was limited; the distribution of age, income, insurance status, and other factors was usually not available. We were able to categorize a little more than half of the samples on the percentage with a high school degree; this somewhat crude measure was not associated with reporting accuracy (results not shown). Sex was also not associated with self-report accuracy (results not shown). Neither the type of documentation facility nor the types of documentation relied on was consistently associated with self-report accuracy (results not shown). Interviews conducted in the clinic and face-to-face interviews, in general, tended to be associated with reduced self-report accuracy compared with telephone-administered or self-administered interviews (results not shown).
Of the 39 sample subgroups analyzed, 15, 9, and 4 subgroups were predominantly White, Black, and Hispanic, respectively. The remainder were either mixed minority dominant (n = 2), did not have a dominant race/ethnicity (n = 7), or were missing data on race/ethnicity (n = 2). Three studies directly compared predominantly White subgroups to predominantly Black and/or Hispanic subgroups with respect to mammography (27, 54), CBE (27), Pap smear (27, 54), PSA (26), DRE (27, 26), FOBT (26, 27), and colorectal endoscopy (26, 27). These comparisons are presented in Table 3. These within-study comparisons suggested that there was a reduced sensitivity of self-reporting for Hispanics compared with Whites for both mammography and DRE (27), but no differences were suggested for specificity (Table 3).
When meta-regressions were done on all studies together, sensitivity tended to be lowest for predominantly Hispanic samples, whereas specificity tended to be lower for both predominantly Hispanic and predominantly Black samples compared with predominantly White samples (Table 4). Compared with White samples, Hispanic ethnicity was associated with reduced sensitivity of mammogram and DRE self-reporting. Hispanic ethnicity was also associated with reduced specificity of Pap smear self-reporting and reduced positive predictive value for mammogram and endoscopy self-reporting. Predominantly Black samples were associated with reduced specificity of Pap smear reporting (Table 4).
Adjusted Prevalence Estimates from the NHIS
We adjusted prevalence estimates from the 1998 and 2000 NHIS using the procedure-specific random-effects summary estimates of sensitivity and specificity that were generated by the meta-analysis (Table 2). The estimated prevalence of mammography in the past 2 years was lowered with adjustment from 0.70 to 0.56, and the prevalence of Pap smear in the past 3 years was lowered from 0.82 to 0.74 (Table 5). Prevalence estimates for PSA and colorectal screening were also lowered after adjustment for self-report accuracy in this manner. We then adjusted race/ethnicity–specific mammogram and Pap smear prevalence estimates from NHIS, using the procedure-specific random-effects summary estimates of sensitivity and specificity that were generated separately for non-Hispanic White, non-Hispanic Black, and Hispanic individuals (Table 4). After adjusting prevalence estimates in this manner, Black-White and Hispanic-White disparities in mammogram and Pap smear prevalence estimates seemed to be considerably larger than those based on the observed estimates alone (Table 6).
The main goal of this analysis was to generate estimates of self-report accuracy for specific cancer-screening procedures that could then be applied to national survey data to improve the accuracy of national prevalence estimates for colorectal cancer screening and screening for other cancers, estimates that are largely based on self-reports. We examined sensitivity as a measure of underreporting and examined both specificity and positive predictive value as measures of overreporting.
We found that estimates of self-report accuracy varied substantially from study to study, making it difficult to arrive at stable estimates of self-report accuracy for different procedures and accuracy measures. Study heterogeneity could be the result of many different factors related to sample socioeconomic and cultural characteristics, the context in which self-reports were elicited (e.g., survey mode and location, wording of question), and the completeness and accuracy of documentation of true screening histories, which are functions of both the strategy for documenting procedures and, equally importantly, the thoroughness with which that strategy is implemented. Details on the strategy and implementation of documenting cancer-screening procedures were scant, and determining the intensity of documentation was generally not possible. The quality of the sources and strategies used to document survey answers might influence conclusions about self-report accuracy in this meta-analysis (67-69). In particular, we thought that studies with a more rigorous documentation strategy would yield higher (and more accurate) specificity and positive predictive value estimates; however, there was little evidence to suggest that this was true. It is likely that some of the heterogeneity in results was due to unmeasured differences in the quality of documentation.
Some validation studies might be conducted to support the results of a larger study rather than to produce generalizable information about self-report accuracy. For such studies, publication bias is likely because there would be less incentive to publish low-accuracy estimates that reflected poorly on the implementation of the larger study. With this in mind, we excluded, from our analysis, validation studies that confirmed only the accuracy of positive self-report histories (from which only positive predictive values could be estimated). Publication bias was stronger for the 15 excluded study subgroups with Pap smear results than was observed in our main analysis, but there was no evidence of publication bias for the 14 excluded study subgroups with mammography results (results not shown).
Relative to underreporting, overreporting seemed to be a greater concern for most of the screening procedures examined. Overreporting for both mammography and Pap smear has been shown to be related to forward telescoping of dates, when an event is remembered as being more recent that it actually was (42, 61). In addition, Pap smears may be overreported if women mistake a routine gynecologic exam without a Pap test as including a Pap test. Incomplete documentation of a test in the medical record would cause a true positive screening history to be classified as a false positive, thereby biasing specificity estimates downward. Specificity for both CBE and DRE are likely to be biased downward by this mechanism, because they do not generate a separate report and are not reimbursed well, further reducing the incentive to document. For this reason, determining the true extent of overreporting of these two tests may not be possible using current documentation strategies. The reason for overreporting of colorectal endoscopies is not clear but may be a reflection of a general tendency for respondents to overreport socially desirable behaviors.
Underreporting seems to be infrequent for breast and cervical cancer-screening self-reports, but more of a concern for prostate and colorectal screening. The relatively poor sensitivity for PSA testing (compared with that for other procedures) may be due to the lack of saliency for the patient. PSA testing does not require a separate appointment; the only cue to the patient is a blood draw indistinguishable from any other blood draw, and men may not always be told that one purpose of the blood draw is for PSA testing. In contrast, colorectal endoscopy is a highly salient event because of the preparation required on the part of the patient and the invasiveness of the procedure. Nonetheless, underreporting of colorectal endoscopy seemed to be similar in magnitude to the other two colorectal exams (FOBT and DRE). Unlike breast and cervical screening, prostate cancer screening is unique to men, whereas colorectal screening is obtained by both men and women. It is possible that the greater underreporting for these tests is due to a greater tendency of men to underreport cancer-screening histories, although we found no evidence of this in either stratified analyses or meta-regressions (results not shown). Underreporting for PSA and DRE could also be a function of embarrassment, although this has not been studied to our knowledge. Because sensitivity is calculated only for positively documented histories, subjects with incomplete documentation are excluded from the calculation of sensitivity. If the true sensitivity was differentially higher among individuals with incomplete documentation of their screening history, this would artificially lower the observed sensitivity, but this pattern of misclassification seems unlikely. For this reason, incomplete documentation in the medical record is not a likely explanation for low sensitivity of prostate and colorectal screening self-reports.
African American and Hispanic women seemed to have a greater tendency to overreport mammogram and Pap smear histories compared with Whites, although the number of studies compared was small and race/ethnicity–stratified estimates were unstable for all of the procedures we examined. The apparently greater overreporting observed in Black and Hispanic versus White subgroups could be partly due to differences in documentation quality or survey context. For example, being part of an intervention program, interviewed face-to-face, and interviewed in the clinic, all of which might be associated with overreporting, were all somewhat more likely to be a study design characteristic associated with predominantly Black as opposed to White subgroups. Although not evident from our meta-analysis, another possibility is that documentation quality is lower at facilities where minority patients are more likely to receive health care. Differences in self-report accuracy may also be partly a consequence of variability in the comprehension of survey questions (70).
Two studies of racial/ethnic differences in self-report accuracy that were not a part of this meta-analysis deserve mention. First, a validation study by McPhee et al. (42), which was excluded from this meta-analysis because it was limited to confirming positive self-reports, found that African American, Hispanic, Chinese, and Filipino women were all more likely than White women to overreport both mammogram and Pap smear histories and that forward telescoping of dates was more common among ethnic minority women. In addition, these differences in self-report accuracy remained after controlling for type of facility. Second, in a study of mammography use (estimated using both claims and survey data) among older Medicare enrollees, the authors found that the survey data failed to identify racial/ethnic disparities in screening that were apparent from analyses using claims data (71). These two studies, in combination with our meta-analysis, suggest that racial and ethnic minority women are more likely to overreport mammography (and perhaps Pap smear histories) when compared with White women. The data are less clear for other screening procedures.
In addition to the apparent differences in overreporting cancer-screening histories, Hispanic individuals may have a greater tendency than Whites to underreport mammogram and colorectal screening histories, according to our meta-analysis. These are two highly invasive and salient procedures. Perhaps cultural differences may cause Hispanics to underreport these procedures out of embarrassment or privacy concerns. Alternatively, it may be that underreporting by Hispanics is a more general phenomenon that was easier to detect in these two generally well-documented procedures; differences in sensitivity could be masked by less complete documentation of other procedures.
Implications for Clinicians
According to this analysis, nearly half of self-reported positive colorectal screening histories and approximately one fifth of self-reported positive breast and cervical cancer-screening histories are likely to be negative. During a doctor's visit, individuals may be even more likely to overreport their cancer-screening histories because of social desirability influences arising from face-to-face contact in a clinic setting. There was some evidence for this in our analyses, although the number of clinic-based validation studies was small. Positive predictive value for mammography, which was 0.79 across all studies in our analysis, was only 0.66 for clinic-based interviews. Given the apparent extent of overreporting of cancer-screening histories, clinicians should not rely on patient reports when making recommendations for future screening. Reliance on patient self-reports could create a barrier to implementation of timely follow-up screening.
Implications for Tracking National Utilization
The Healthy People 2010 goals call for increasing the percentage of women adhering to national cancer-screening guidelines. Healthy People 2010 calls for an increase, by 2010, in Pap smear utilization in the preceding 3 years from 92% to 97%, mammography in the preceding 2 years from 67% to 70%, annual FOBT from 35% to 50%, and colorectal endoscopy in the preceding 5 years from 37% to 50% (6). Results from this meta-analysis indicate that we are probably further from these goals than survey data suggest. Another broad goal of Healthy People 2010 is the reduction of disparities in health and health care utilization. Again, according to this meta-analysis, disparities in cancer screening by race/ethnicity are likely to be larger than they seem to be in national survey data. These inaccuracies need to be taken into account when interpreting progress toward the Healthy People 2010 goals of increasing utilization and reducing disparities. Because the NHIS is the major source of data on cancer screening used for tracking prevalence in the U.S. population, validation studies should be undertaken for a sample of respondents within the NHIS, and designed with enough power to detect meaningful differences in sensitivity and specificity for different racial/ethnic and socioeconomic groups.
Grant support: National Cancer Institute grant 5 P50 CA 106743.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.