Abstract
The Gail model and the model developed by Tyrer and Cuzick are two questionnaire-based approaches with demonstrated ability to predict development of breast cancer in a general population.
We compared calibration, discrimination, and net reclassification of these models, using data from questionnaires sent every 2 years to 76,922 participants in the Nurses' Health Study between 1980 and 2006, with 4,384 incident invasive breast cancers identified by 2008 (median follow-up, 24 years; range, 1–28 years). In a random one third sample of women, we also compared the performance of these models with predictions from the Rosner–Colditz model estimated from the remaining participants.
Both the Gail and Tyrer–Cuzick models showed evidence of miscalibration (Hosmer–Lemeshow P < 0.001 for each) with notable (P < 0.01) overprediction in higher-risk women (2-year risk above about 1%) and underprediction in lower-risk women (risk below about 0.25%). The Tyrer–Cuzick model had slightly higher C-statistics both overall (P < 0.001) and in age-specific comparisons than the Gail model (overall C, 0.63 for Tyrer–Cuzick vs. 0.61 for the Gail model). Evaluation of net reclassification did not favor either model. In the one third sample, the Rosner–Colditz model had better calibration and discrimination than the other two models. All models had C-statistics <0.60 among women ages ≥70 years.
Both the Gail and Tyrer–Cuzick models had some ability to discriminate breast cancer cases and noncases, but have limitations in their model fit.
Refinements may be needed to questionnaire-based approaches to predict breast cancer in older and higher-risk women.
Introduction
Breast cancer prediction rules, based solely on questionnaire information without data from biomarkers or mammograms, can be implemented noninvasively and at minimal cost in large populations. Although these prediction rules have limitations in their overall ability to distinguish women who will and will not develop breast cancer (1–3), they have been utilized for risk stratification for chemoprevention and screening protocols (4–6).
Information on the relative performance of the alternative risk models in general populations is still somewhat limited, with available evidence indicating modest concordance in risk classification and limited discrimination in external validation (3, 7). Perhaps the two most widely evaluated models that do not require biomarker or mammographic data are the Breast Cancer Risk Assessment Tool (BCRAT) developed by Gail and colleagues (8–12) and the International Breast Cancer Intervention Study (IBIS) risk score developed by Tyrer and Cuzick (13). Explicit comparisons of discrimination, calibration, and classification performance between these two models have used selected populations of higher-risk women enriched for family history or risk factors such as high rates of delayed childbirth (1, 14, 15). Further, these studies included relatively small numbers of breast cancer cases (<250 in each study), limiting the ability to evaluate the accuracy of classification of women across a wide range of clinical risk categories. All three found better calibration and discrimination with the Tyrer–Cuzick model relative to the Gail model. The impact of enrichment of the study populations with women who have a positive family history is unclear.
In this article, we compare metrics of model performance, including calibration, discrimination, and ability to reclassify cases into higher clinical risk categories and noncases into lower-risk categories (net reclassification indices) between the Gail and Tyrer–Cuzick models in the broad population of U.S. nurses participating in the Nurses' Health Study, including a higher percentage of women at average risk. Also, we compare the performance of these models with that of an updated version of the alternative Rosner–Colditz risk prediction model, as developed in a sample of participants in the Nurses' Health Study, and evaluated in an independent sample (16–19).
Materials and Methods
The Nurses' Health Study cohort was established in 1976 when 121,701 female registered nurses ages 30 to 55 years responded to a mailed questionnaire inquiring about risk factors for breast cancer, including reproductive factors, menopausal hormone therapy use, anthropometric variables, benign breast disease, and family history of breast cancer. The risk factor data have been updated by means of repeat questionnaires sent every 2 years up to the present time (20).
Alcohol consumption, both current and at age 18 years, was ascertained in 1980, with information updated in 1984, and then every 4 years from 1986 to 2006. Measures of family history of breast and ovarian cancer, utilized in the Tyrer–Cuzick model, were assessed at several times during follow-up (21). Information on breast cancer in a woman's mother and the number of her sisters with breast cancer was collected first in 1976, then updated in 1982, 1988, 1992, 1996, 2000, and 2004, with updates on age at diagnosis for each in 1996, 2000, and 2004. Women were asked about breast cancer in their maternal and paternal grandmothers in 1988; in their daughters in 2000 and 2004; and about ovarian cancer in their mothers and sisters in 1992, 1996, and 2000, and in their daughters in 2004.
Identification of breast cancer cases
On each questionnaire, women were asked whether breast cancer had been diagnosed and, if so, the date of diagnosis. All women (or their next of kin, if deceased) were contacted for permission to review their medical records so as to confirm the diagnosis. Cases of invasive breast cancer from 1980 to 2008 for which we had a pathology report were included in these analyses. We excluded women with types of menopause other than natural menopause or bilateral oophorectomy because of the inability to determine the true age at menopause and menopausal status, prevalent cancer (other than nonmelanoma skin cancer) in 1980, or missing data for weight at age 18 years, age at first birth, parity, age at menarche, age at menopause, or menopausal hormone therapy use.
During follow-up of 76,922 (768,948 2-year intervals) women with complete data on baseline risk factors from 1980 to 2006, 4,384 women developed invasive breast cancer. We censored women who developed another type of cancer (except nonmelanoma skin cancer) at their diagnosis date.
Analysis
All estimates of risk from the Gail, Tyrer–Cuzick, and Rosner–Colditz models used 2-year risk windows. This was expected to maximize predictive performance, as all models used time-varying covariates which were updated at 2-year intervals. Thus, for a woman still cancer free at the beginning of a follow-up interval, her risk over the subsequent 2 years was estimated based on her risk factor profile at that time. For variables not updated at each questionnaire, including family history and alcohol use information, we carried forward responses from prior questionnaires. This approach parallels previous strategies used to evaluate time-varying risk (22–24).
Rockhill and colleagues (25) previously evaluated the fit and discriminatory ability of the BCRAT model in the Nurses' Health Study, based on data from 1992 through 1997. We used the BRCa_RAM SAS macro developed by the Division of Cancer Epidemiology and Genetics at the National Cancer Institute (http://dceg.cancer.gov/tools/risk-assessment/bcrasasmacro) to estimate a woman's Gail model risk of developing breast cancer over a 2-year period, separately for every 2-year interval with updated risk factor information, beginning in 1980 and continuing as long as a woman was alive, reporting risk factor information, and free of breast cancer and other cancer types except nonmelanoma skin cancer. The variables in the Gail model and their assessment in the Nurses' Health Study are described in Supplementary Table S1. As in Rockhill and colleagues (25), the presence of hyperplasia was coded as missing because this variable was only assessed in a small group of participants in the Nurses' Health Study. Although imputation of hyperplasia status can be useful, we chose not to apply models that include the outcome (breast cancer development) in the imputation of hyperplasia status and have been found to have a small impact on the C-statistic for prediction (26). Also, we were able to classify women at the beginning of an interval only with regard to ever/never history of previous benign breast biopsy, rather than 0, 1, or greater than or equal to two biopsies as specified in the Gail model.
We also estimated a woman's 2-year risk of breast cancer, separately for each of the time intervals she contributed to the analysis based on her updated information from the Tyrer–Cuzick model, as implemented from a command line version downloaded from http://www.ems-trials.org/riskevaluator, as directed by a personal communication from the authors. Variables included in the Tyrer–Cuzick and Gail models and their assessment in the Nurses' Health Study are described in Supplementary Table S1. As for the Gail model, we set to missing the indicators of hyperplasia status and also did not have information on a woman's Ashkenazi heritage, her expected future duration of hormone therapy, bilaterality of breast cancer in relatives, or on her genetic testing or that of her relatives. We also invoked the model's missing data option for family history variables in a woman's second- or third-degree relatives (except for available information on grandmothers which was utilized).
Evaluation of calibration of the models compared observed and expected risks within deciles of predicted risks for each of the Tyrer–Cuzick and Gail models. The unit of analysis for these comparisons was the observed and predicted outcome within a 2-year interval. We used the large sample confidence interval (CI) for the ratio of expected to observed events based on log transformation of this ratio and the delta method, as previously applied by Park and colleagues (27). Consistent with this CI, we used the Z-statistic defined as log(E/O)/sqrt(1/O) to test the null hypothesis that the expected to observed ratio (E/O) was equal to 1 within a decile of predicted risk. In addition to decile-specific ratios and CIs of observed to expected event ratios, we used the Hosmer–Lemeshow test statistic as an indicator of calibration. Graphical display of the observed versus expected numbers of cases within each decile of risk included 95% CIs for the observed count, with use of a log transformation for variance stabilization, as above. Subgroup analyses evaluated calibration for each model separately using intervals in women age <50, 50–59, 60–69, and ≥70 when the interval started.
We also compared discrimination between the two models, both overall and within age groups with age defined at the beginning of each 2-year interval. Estimates of standard errors of overall, age-adjusted, and age-specific C-statistics were compared between models using the approach of Rosner and Glynn (28).
To evaluate risk reclassification based on alternative models, we used four a priori–chosen absolute 2-year risk categories suggested by Tice and colleagues (29): 0–<0.4%; 0.4–<0.67%; 0.67–<1.0%; and ≥1.0%. Following recommendations of Kerr and colleagues (30), we report reclassification percentages separately for breast cancer cases and noncases, again with 2-year time windows as the unit of analysis. Additional subgroup analyses considered risk reclassification separately among intervals in each of the four age groups defined above. As additional subgroup analyses, we considered calibration, discrimination, and reclassification in intervals among women with a family history of breast cancer in a first-degree relative.
We also compared calibration and discrimination of the Gail and Tyrer–Cuzick models with that of the Rosner–Colditz model. Estimates of the parameters of the Rosner–Colditz model were obtained using all available study time in a two third random sample of study participants, and its calibration and discrimination were evaluated in the other third of the study population over the same time period from 1980 until 2008 (19). Herein, we also use this one third sample of the study population to compare calibration and discrimination of the Gail and Tyrer–Cuzick models with that of the Rosner–Colditz model.
Results
In the 768,948 2-year intervals during the time period from 1980 to 2008, 4,384 women developed incident, invasive breast cancer for an average 2-year risk of 0.57%. Supplementary Table S2 compares distributions of characteristics at the beginning of intervals among all women, those with a history of breast cancer in a first-degree relative, and those who developed breast cancer during that interval.
Overall, both the Gail model and the Tyrer–Cuzick model slightly overestimated the number of incident breast cancer cases in the Nurses' Health Study. Specifically, the average 2-year predicted risk from the Gail model was 0.60%, and this model predicted 5% more cases than observed (95% CI, 2%–8%; Table 1). The average 2-year predicted risk from the Tyrer–Cuzick model was 0.62%, and this model predicted 9% more cases than observed (95% CI, 5%–12%; Table 1). However, agreement between observed and predicted numbers of cases varied substantially according to predicted risk. Both models substantially underestimated the number of cases in the lowest decile of their predicted risk (24% fewer expected cases than observed for the Gail model and 19% fewer expected cases than observed for the Tyrer–Cuzick model). Conversely, both models substantially overestimated the number of cases in the highest decile of their predicted risk (40% more expected than observed for the Gail model and 34% more expected than observed for the Tyrer–Cuzick model). Graphical comparisons of observed versus expected counts illustrated these differences but showed good agreement for predictions within deciles 2 to 9 of each model (Fig. 1A and B). For both models, the Hosmer–Lemeshow test of the null hypothesis that the model is adequately calibrated was highly significant, suggesting some miscalibration.
Separate analyses of calibration for the two models restricted to women within each of four age groups (<50, 50–59, 60–69, and ≥70) found evidence for misclassification of each model within each age group (Supplementary Tables S3–S6). In particular, underprediction of risk was noted for both models among lower-risk women younger than 50, and overprediction of risk was seen in higher-risk women in the two age groups age 60 or above.
Discrimination, as measured by the C-statistic, was better for the Tyrer–Cuzick model (0.629) than for the Gail model (0.608; Table 2). When discrimination was examined separately in each of four age groups, discrimination was slightly better by the Tyrer–Cuzick model in each age group. A weighted average of the age-specific C-statistics, which somewhat adjusts for age, found lower C-statistics from each model (0.600 for the Tyrer–Cuzick model and 0.574 for the Gail model).
A comparison of the ability to reclassify cases into meaningfully higher-risk groups and noncases into meaningfully lower risk groups found different conclusions for these two comparisons (Table 3). The Tyrer–Cuzick model reclassified 27.3% of incident cases into a higher-risk category than the Gail model, whereas the Gail model reclassified 15.1% of cases into a higher-risk category than the Tyrer–Cuzick model, for a net reclassification of cases of 12.2%. Conversely, the Gail model reclassified 22.4% of noncases into a lower-risk category than the Tyrer–Cuzick model, whereas the Tyrer–Cuzick model reclassified 16.2% of noncases into a lower-risk category, for a net reclassification of noncases of 6.2%. Some heterogeneity in this reclassification pattern was observed when reclassification was evaluated separately in each of four age groups (Supplementary Tables S7–S10). Specifically, while in the three younger age groups (women under age 70), the Tyrer–Cuzick model reclassified a higher percentage of cases to a higher-risk category and the Gail model reclassified a higher percentage of non-cases to a lower-risk category, for women age ≥ 70, the Gail model reclassified a higher percentage of cases to a higher-risk category, and the Tyrer–Cuzick model reclassified a higher percentage of noncases to a lower-risk category.
In additional subgroup analyses of intervals in women who had a family history of breast cancer (Supplementary Tables S11–S13), risk remained overestimated among women in the highest risk groups for both models. The magnitude of overestimation was greater in this subgroup than observed in the whole population (47% and 50% more expected than observed for the Gail and Tyrer–Cuzick models, respectively; Supplementary Table S11 and Supplementary Figs. S1A and S1B). Discrimination remained better for the Tyrer–Cuzick model relative to the Gail model, but it was overall weaker for both models in this restricted population relative to the results in the entire cohort. As for the overall analyses, the Tyrer–Cuzick model reclassified more cases to higher-risk categories, whereas the Gail model reclassified more noncases to lower-risk categories among women with a family history.
In the one third sample of women set aside for validation of the refitted Rosner–Colditz model, 1,418 incident breast cancer cases occurred in 254,767 2-year intervals for a 2-year risk of 0.56%. In this validation sample, the Rosner–Colditz model had an average 2-year risk of 0.58% (Table 4), which yielded an overall ratio of expected to predicted numbers of events of 1.04 (95% CI, 0.98–1.09). Overall, calibration of the Rosner–Colditz model was adequate in this independent sample (Hosmer–Lemeshow χ2 P = 0.18). Both the Gail and Tyrer–Cuzick models showed the same patterns seen in the entire dataset of fewer predicted than observed events in the lowest-risk decile and more predicted than observed events in the highest risk decile within this validation sample (Table 4).
Comparisons of model discrimination within the one third validation sample showed that the Rosner–Colditz model had a higher overall C-statistic than the Gail model (0.65 vs. 0.60) and also higher than the Tyrer–Cuzick model (0.65 vs. 0.63; Table 5). As seen for the other two models in the entire dataset, the Rosner–Colditz model also had the weakest age group–specific discrimination among women age 70 years or older (0.59).
Discussion
We used data from 26 years of experience in the Nurses' Health Study to compare the performance of alternative simple models, based only on information obtained from questionnaires, to predict the occurrence of invasive breast cancer. Overall, we confirmed that each of the Gail, Tyrer–Cuzick, and Rosner–Colditz models has only moderate ability to predict breast cancer (1–3, 13–15). New findings from our study include evidence of miscalibration in the Gail and Tyrer–Cuzick models, especially among women in the lowest- and highest-risk groups, better reclassification of cases to higher-risk categories by the Tyrer–Cuzick model relative to the Gail model, and better reclassification of noncases to lower-risk categories by the Gail model relative to the Tyrer–Cuzick model.
Additional testing, including measures of mammographic density and testing for relevant genetic variation, can somewhat improve model discrimination (29, 31–34). Addition of mammographic density and risk factor–based prediction models could be easily accommodated with appropriate referral of women according to level of risk—to consider chemoprevention or lifestyle changes (weight loss/physical activity, etc.). SNP assessment and polygene score generation is not yet routine and still has hurdles to overcome before integration into a routine breast cancer risk assessment at first screening mammogram. Other costly and logistically complex measures such as endogenous hormones improve prediction (measured by the C-statistic) in the Rosner–Colditz model by about 5%, but only in analyses restricted to postmenopausal women not using postmenopausal hormones at blood collection (35). Also, although models including only information from questionnaire are probably not sensitive enough to excuse a woman from screening on the basis of a low predicted risk, they are explicitly used in cross-national guidelines to direct clinical decisions (4–6, 36).
Three previous studies made direct comparisons of predictions from the Gail and Tyrer–Cuzick models, each conducted in study populations enriched for family history or risk factors such as delayed childbirth (1, 14, 15). In all three, the Gail model was found to underestimate risk (as indexed by a ratio of expected to observed events significantly below 1), whereas the CI for the expected to observed ratio from the Tyrer–Cuzick model included 1 for each. Further, each of these comparisons found better discrimination (as indexed by higher C-statistics) from the Tyrer–Cuzick relative to the Gail model. However, the relatively small number of incident cases included in each of these studies (<250) limited the power to detect deviations between observed and expected event counts, especially within deciles of risk such as the lowest- and highest-risk women. Further, oversampling of high-risk women, and particularly those with a positive family history, may have favored the performance of the Tyrer–Cuzick model which particularly focuses on this component of risk.
Our study among a larger population spanning all levels of risk agreed with this previous literature in finding slightly better discrimination with the Tyrer–Cuzick relative to the Gail model, and extended the previous work by showing the discrimination under the Tyrer–Cuzick model was slightly better within each of four age groups. We also extended previous work by finding decreased discrimination under both models in older women. In contrast to previous studies, we found evidence for miscalibration of both models, and that predicted risks differed from observed risks particularly in the lowest- and highest-risk women. Specifically, both models underestimated risk among women in their lowest predicted decile of risk, and overestimated risk among women in their highest predicted decile of risk, particularly among women with a family history of breast cancer. With respect to risk reclassification across established categories of clinical risk, we found that the Tyrer–Cuzick model more likely reclassified women who developed breast cancer during the 2-year interval to a higher-risk category, but the Gail model more likely reclassified women who did not develop breast cancer to a lower-risk category. These overall patterns of risk reclassification were different among women age 70 or older. Even when reclassification is separately considered among cases and noncases, interpretation of these indices is problematic when models exhibit some level of miscalibration (37).
Relative to an evaluation of a previous version of the Gail model performed within the Nurses' Health Study at a time when no women were age 75 or older, and hence average breast cancer risk was lower (25), we found a slightly higher level of discrimination [C-statistic 0.61 (95% CI, 0.60–0.62), compared with 0.58 (95% CI, 0.56–0.60) in Rockhill and colleagues (25)]. Consistent with that report, we found the ratio of expected to observed cases under the Gail model to be less than 1 for lower-risk women and greater than 1 for higher-risk women, but the magnitude of this heterogeneity was greater in our updated analysis (ranging from 0.76 in the lowest decile to 1.40 in the highest decile of predicted risk, as seen in Table 1). Also, although Rockhill and colleagues observed that the risk among women in the highest decile of estimated risk was 2.83 times that of women in the lowest decile, the corresponding relative risk in the current analysis was 3.95 (Table 1). These trends likely reflect the greater range of risks corresponding to the wider age range in our updated data.
Our comparison of the Tyrer–Cuzick and Gail models with the Rosner–Colditz model in a separate sample of Nurses' Health Study participants found better discrimination and calibration in the Rosner–Colditz model. These three models include several common variables, but also involve different parameterizations of some of these variables, including interactions involving menopausal status in the Rosner–Colditz model. The models also include some different variables, such as extended family history information in the Tyrer–Cuzick model and consideration of alcohol consumption history and more details on postmenopausal hormone therapy in the Rosner–Colditz model. Although the Nurses' Health Study has maintained a focus on risk factors for breast cancer since its inception, several components of the Gail and Tyrer–Cuzick models were not measured. Also, key variables including measures of family history were not updated at each questionnaire. Although the unmeasured components were not highly prevalent characteristics, their unavailability somewhat limited our comparisons. It is likely that a small group of women had their risk of breast cancer underestimated because of this missing information, but overall risk in the entire study population was slightly but significantly overestimated by both the Tyrer–Cuzick and Gail models. A future question is whether simpler models are possible that would attain nearly equivalent performance in prediction and be more easily integrated into routine breast health services. Considerable effort is currently underway to improve simple models, while limiting the burden of data collection to maximize participation and enhance generalizability (38–40).
In summary, our comparison of three readily implemented risk prediction rules for breast cancer found somewhat better discrimination in the Rosner–Colditz model. We also saw evidence for miscalibration of the Gail and Tyrer–Cuzick models, particularly among the highest- and lowest-risk women in the Nurses' Health Study. The Rosner–Colditz model includes more variables which take longer for their assessment. For women in the extreme deciles of risk, prediction from the Rosner–Colditz model is somewhat more accurate than prediction in the Tyrer–Cuzick and Gail models.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: R.J. Glynn, G.A. Colditz, R.M. Tamimi, W.W. Willett, B. Rosner
Development of methodology: R.J. Glynn, G.A. Colditz
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): G.A. Colditz, W.W. Willett
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): R.J. Glynn, G.A. Colditz, W.Y. Chen, S.E. Hankinson, B. Rosner
Writing, review, and/or revision of the manuscript: R.J. Glynn, G.A. Colditz, R.M. Tamimi, W.Y. Chen, S.E. Hankinson, W.W. Willett, B. Rosner
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): R.J. Glynn, R.M. Tamimi
Study supervision: R.J. Glynn, B. Rosner
Acknowledgments
This project was funded by a cohort infrastructure grant (UM1 CA186107), and a program project grant (P01 CA87969) from the National Cancer Institute.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.