Abstract
Mammographic density is a strong risk factor for breast cancer and is reported clinically as part of Breast Imaging Reporting and Data System (BI-RADS) results issued by radiologists. Automated assessment of density is needed that can be used for both full-field digital mammography (FFDM) and digital breast tomosynthesis (DBT) as both types of exams are acquired in standard clinical practice. We trained a deep learning model to automate the estimation of BI-RADS density from a prospective Washington University clinic-based cohort of 9,714 women, entering into the cohort in 2013 with follow-up through October 31, 2020. The cohort included 27% non-Hispanic Black women. The trained algorithm was assessed in an external validation cohort that included 18,360 women screened at Emory from January 1, 2013, and followed up through December 31, 2020, that included 42% non-Hispanic Black women. Our model-estimated BI-RADS density demonstrated substantial agreement with the density as assessed by radiologists. In the external validation, the agreement with radiologists for category B 81% and C 77% for FFDM and B 83% and C 74% for DBT shows important distinction for separation of women with dense breast. We obtained a Cohen’s of 0.72 (95% confidence interval, 0.71–0.73) in FFDM and 0.71 (95% confidence interval, 0.69–0.73) in DBT. We provided a consistent and fully automated BI-RADS estimation for both FFDM and DBT using a deep learning model. The software can be easily implemented anywhere for clinical use and risk prediction.
Prevention Relevance: The proposed model can reduce interobserver variability in BI-RADS density assessment, thereby providing more standard and consistent density assessment for use in decisions about supplemental screening and risk assessment.
Introduction
In the era of precision prevention and tailored screening, there is an increasing emphasis on getting the right prevention to the right women at the right time and tailoring screening examination protocols to women based on individual risk. Underpinning this approach is the need for accurate risk assessments that can be generated and delivered in real time in the clinic (1). Strong evidence shows that adding mammographic density to breast cancer risk prediction models improves their performance (2, 3). A systematic review identified seven studies (out of 11) that showed significant increase in the AUC when mammographic breast density was added to the prediction model. The increase in the AUC ranged from 0.03 to 0.14. Thus, the major breast cancer risk models now also include a measure of breast density (4), which is considered an intermediate marker of risk as well as a surrogate endpoint in prevention trials (5, 6).
There is a long record of epidemiologic investigations using mammographic density estimated from films and, more recently, from digital images (7–11). Additional research has focused on texture and other features beyond breast density (12–15). Current clinical mammographic density assessment relies heavily on subjective radiologist assessment, as described in the fifth edition of the Breast Imaging Reporting and Data System (BI-RADS), rather than on quantitative volumetric analysis (16). However, the interobserver variability amongst radiologists is inevitable and occurs even with the same radiologist from year to year. This results in inconsistent and potentially less accurate recommendations for supplemental screening examinations, such as MRI or ultrasound, and changes in the calculated risk assessment (17–20).
Accurate, efficient, and consistent processing of mammograms in real time to guide subsequent clinical decisions is, therefore, a priority (1). National mandating reporting of breast density to women will be required by the U.S. FDA beginning September 10, 2024, which will further drive the clinical need for accurate and consistent density assessment.
There exist previous works on automated density estimation, mostly based on full-field digital mammography (FFDM) images (21, 22). However, automated mammographic density assessment models have moved from using FFDM to digital breast tomosynthesis (DBT) in the United States. DBT was approved by the FDA for all women in 2011 (23). It improves the cancer detection rate on screening (24) and has demonstrated usefulness in both screening and diagnostic settings (25). The uptake of tomosynthesis for breast screening has varied over time based on insurance status and other population characteristics (23, 26). It was approved by Medicare in 2015, but through 2017, women from underrepresented race/ethnic groups, having lower education and income, and from rural residences had been slower to access this technology (26).
Given that both FFDM and DBT are used in clinical practice, more than 80% of mammography screening procedures now use DBT (26). It is imperative that mammographic density can be automated using both FFDM and DBT images. We note that there exist fully automated tools, e.g, Volpara and Quantra, which can perform mammographic density estimation for both FFDM and DBT. However, both Volpara and Quantra rely on using raw mammogram data or “for processing” images (21) which are typically not stored longer than a month in most clinics and research institutions as they are not used for image interpretation. In this study, we provide a deep learning algorithm that can be directly applied to processed or “for presentation” FFDM and DBT images that are used for image interpretation in clinical practice. We draw on two routine screening services (27, 28) with a mixture of FFDM and DBT studies to develop and externally validate the deep learning model to automate density per assessment given in the fifth edition of BI-RADS Atlas (29).
Materials and Methods
Analytic data set
The Joanne Knight Breast Health Cohort at Washington University (WashU cohort) is used as the source for training data in this study (30). This cohort of women were recruited from November 2008 to April 2012 through an American College of Radiology–accredited and –designated comprehensive breast imaging center providing routine breast screening in St. Louis and includes 10,481 women free from breast cancer with 27% non-Hispanic (NH) Black women (30). The age at entry ranges from 23.1 to 93.3, and 61% of the women are postmenopausal. Eligibility criteria included consent for follow-up and attending a routine screening visit. We excluded 389 women whose entry examination led to the diagnosis of breast cancer and 121 women whose retrieved images did not contain all four standard views. In 2015, the breast health service transitioned to all screening mammograms being DBT. All mammograms were uniformly processed using Hologic machines. For this analysis, we restricted the FFDM mammograms to be past 2013 to ensure that the recorded BI-RADS density used the fifth edition definitions and identified 9,714 women. From this cohort, we identified women free from cancer at their first available DBT examination (4,736 women) past 2015. On entry to the cohort, women self-reported breast cancer risk factors using established and validated measures (31).
External validation cohort
The external validation data set is drawn from the Emory Breast Imaging Dataset (EMBED) with 116,902 patients with up to 8 years of mammograms (28). The public-access cohort represents a 20% random sample from the full EMBED with de-identified mammograms of 22,383 diverse women (42% NH Black) undergoing screening or diagnostic mammograms from January 2013 through December 2020. The age at entry ranges from 20.2 to 89. Similar to the WashU cohort, we excluded women who had diagnostic images (n = 2,734) and images that did not contain all four standard views (n = 1,289), leaving a cohort of 18,360 women. Approximately 35.9% (n = 6,586) of the women underwent DBT, of whom 58.5% (n = 3,855) have both FFDM and DBT at different breast screening visits included in the EMBED. The data included age, race, and time from the initial digital screening mammogram to breast cancer diagnosis. Mammograms were obtained using Hologic machines (92%), GE HealthCare (6%), and Fujifilm (2%). All BI-RADS density values recorded in Emory use the fifth edition.
Training the density assessment model
Model training in FFDM
We transformed Digital Imaging and Communications in Medicine (DICOM) files from presentation view into 16-bit PNG files using the pydicom and PIL tools. In the training dataset, the images are all processed by Hologic. In the external dataset, the images are processed by on Hologic machines (92%), GE (6%), and Fujifilm (2%). Our model takes mediolateral oblique and craniocaudal images as input. For FFDM images, these are the standard four views of mammograms. All mammograms have each been resized to 1,664 × 2,048 pixels in this analysis. All mammograms are de-meaned (centered) and normalized in a column-wise fashion. The mean and SD are saved from the training dataset and subsequently applied onto the external validation.
Each view of the mammogram is independently encoded by using ResNet-18 with a global max pooling layer to compress the image representation to a 512-dimensional vector (32). The model was trained using the Adam optimizer with a learning rate of 10−4 and a weight decay of 10−5. Given the four views for each woman, we end up with a 2,048-dimensional vector that summarizes all information embedded in the mammograms. Results are reported for the epoch that had the lowest cross-entropy loss on the validation set.
Transfer learning for synthetic DBT
Similar preprocessing procedures have been performed in synthetic DBT images. Specifically, we utilized synthesized DBT images that are automatically generated from a series of raw 2D projections. All synthesized DBT images have each been resized to 1,664 × 2,048 pixels in this analysis. All synthesized DBT are de-meaned (centered) and normalized in a column-wise fashion. The mean and SD are saved from the training dataset and subsequently applied onto the external validation.
Transfer learning is a powerful technique in machine learning that involves leveraging a pretrained model on a new, but related, task (33). In this study, we freeze the weights from the previously trained model using FFDM images and replace the final classification layer of the pretrained model with a new layer in synthetic DBT images. This involves training the model with a low learning rate of 10−5 to adapt the weights from the FFDM images to the synthetic DBT images. Fine-tuning allows the model to retain the learned features from the FFDM screening while adjusting to the specific characteristics of synthetic DBT images.
Density output
The output from our model is a continuous measure that is converted into BI-RADS categories. The BI-RADS categories are determined from three model-defined cut-off points. Our model-defined cut-off points are (0, 1.5) for BI-RADS A, (1.501, 2.3) for BI-RADS B, (2.301, 3.3) for BI-RADS C, and (3.301, ) for BI-RADS D. This cut-off is agnostic to FFDM or synthetic DBT. For illustrations, we show in Supplementary Fig. S1 the distribution of the continuous density measures estimated by FFDM using our proposed model in the external validation cohort.
Statistical analysis
We developed two deep learning models using the WashU cohort to estimate BI-RADS density at each of the examinations (34). Our model takes all four views of processed or “for presentation” mammograms (craniocaudal and mediolateral oblique views) as input. The first model was trained using only FFDM mammograms. The second model was fine-tuned from the FFDM model to accommodate synthetic DBT images that are generated from the series of raw projections. This model for synthetic DBT uses a transfer learning approach that transfers knowledge from the pretrained FFDM model to the new synthetic DBT task.
Testing and validation
The WashU dataset was randomly split, with 20% of women in testing, 15% in validation, and the rest for training. To assess the classification performance of the proposed algorithm within the WashU cohort, a confusion matrix was generated for the 20% of women in the testing set. We emphasize that the Emory cohort is only used for testing. Therefore, all women within the Emory cohort that we have constructed have been projected back onto the trained model in the WashU cohort to record the model performance. Model performance is reported using a confusion matrix that compares the absolute counts of radiologist-scored BI-RADS density versus BI-RADS density estimated via the proposed method. The evaluation of the concordance between radiologists’ BI-RADS scores and the BI-RADS density estimated by the proposed method is measured using Cohen’s .
Misclassification error is reported with a confusion matrix that compares the absolute counts of radiologist-scored BI-RADS density versus BI-RADS density estimated via the proposed method. We demonstrate the confusion matrix by BI-RADS A, B, C, and D, as well as by dense (A/B) versus nondense (C/D).
Additionally, we evaluate the concordance between radiologists’ assessments using BI-RADS (fifth edition) and the BI-RADS density estimated by the proposed method. Cohen’s κ is a measure used to quantify interrater agreements (35). If the raters are in complete agreement, then κ = 1; if there is no agreement, then κ = 0. Performances for both the misclassification error and interrater agreement are separately reported for FFDM and synthetic DBT.
This prospective cohort study was supported by WashU. Ethical approval was obtained from the Institutional Review Board of WashU in St. Louis. Written informed consent was obtained for study participation, and the study was conducted in accordance with the Declaration of Helsinki. The Emory cohort de-identified data were shared following Institutional Review Board approval.
Data availability
Development data mammogram images at WashU are available with data use agreement. Requests to access the data should be directed to the corresponding authors. External validation data from Emory are publicly available at https://github.com/Emory-HITI/EMBED_Open_Data.
Results
Cohort characteristics
In the WashU cohort, breast cancer risk factors were assessed at entry to the cohort for the women in this prospective study (Table 1). There was no important difference between FFDM and synthetic DBT distribution in the radiologist-assigned qualitative breast density [fifth edition BI-RADS A/B categories (“not dense”) vs. BI-RADS C/D categories (“dense”)]. The cohort included 26% NH Black and 70% NH White women.
. | WashU derivation cohort . | Emory validation cohort . | ||
---|---|---|---|---|
. | FFDM (n = 9,714) . | DBT (n = 4,736) . | FFDM (n = 15,629) . | DBT (n = 6,586) . |
Mean (SD) | ||||
Age (years) | 55.7 (10.0) | 54.6 (8.7) | 55.6 (12.2) | 57.8 (11.6) |
Number (%) | ||||
BI-RADS | ||||
A | 999 (10.3%) | 454 (9.6%) | 1,647 (10.6%) | 720 (10.9%) |
B | 4,926 (50.8%) | 2,323 (49.0%) | 6,488 (41.5%) | 2,628 (39.9%) |
C | 3,350 (34.5%) | 1705 (36.1%) | 6,565 (42.0%) | 2,825 (42.9%) |
D | 422 (4.4%) | 239 (5.0%) | 926 (5.9%) | 413 (6.3%) |
NR | 0 (0%) | 15 (0.3%) | 0 (0%) | 0 (0%) |
Race | ||||
White | 6,768 (69.8%) | 3,321 (70.1%) | 6,413 (41.1%) | 2,560 (38.9%) |
Black | 2,549 (26.3%) | 1,251 (26.4%) | 6,584 (42.1%) | 3,031 (46.0%) |
Asian | 83 (0.9%) | 36 (0.8%) | 968 (6.2%) | 465 (7.1%) |
Others | 88 (0.9%) | 34 (0.7%) | 268 (1.7%) | 62 (0.9%) |
NR | 209 (2.1%) | 94 (2.0%) | 1,393 (8.9%) | 468 (7.1%) |
Breast cancer cases | 469 (4.8%) | 105 (2.2%) | 408 (2.6%) | 133 (2.0%) |
. | WashU derivation cohort . | Emory validation cohort . | ||
---|---|---|---|---|
. | FFDM (n = 9,714) . | DBT (n = 4,736) . | FFDM (n = 15,629) . | DBT (n = 6,586) . |
Mean (SD) | ||||
Age (years) | 55.7 (10.0) | 54.6 (8.7) | 55.6 (12.2) | 57.8 (11.6) |
Number (%) | ||||
BI-RADS | ||||
A | 999 (10.3%) | 454 (9.6%) | 1,647 (10.6%) | 720 (10.9%) |
B | 4,926 (50.8%) | 2,323 (49.0%) | 6,488 (41.5%) | 2,628 (39.9%) |
C | 3,350 (34.5%) | 1705 (36.1%) | 6,565 (42.0%) | 2,825 (42.9%) |
D | 422 (4.4%) | 239 (5.0%) | 926 (5.9%) | 413 (6.3%) |
NR | 0 (0%) | 15 (0.3%) | 0 (0%) | 0 (0%) |
Race | ||||
White | 6,768 (69.8%) | 3,321 (70.1%) | 6,413 (41.1%) | 2,560 (38.9%) |
Black | 2,549 (26.3%) | 1,251 (26.4%) | 6,584 (42.1%) | 3,031 (46.0%) |
Asian | 83 (0.9%) | 36 (0.8%) | 968 (6.2%) | 465 (7.1%) |
Others | 88 (0.9%) | 34 (0.7%) | 268 (1.7%) | 62 (0.9%) |
NR | 209 (2.1%) | 94 (2.0%) | 1,393 (8.9%) | 468 (7.1%) |
Breast cancer cases | 469 (4.8%) | 105 (2.2%) | 408 (2.6%) | 133 (2.0%) |
Abbreviation: NR, not reported.
Comparable BI-RADS distribution and ethnic diversity in the Emory external validation cohort are reported in Table 1. The cohort included 42% NH Black women. There was no important difference between FFDM and synthetic DBT distribution in the radiologist-assigned qualitative breast density [fifth edition BI-RADS A/B categories (“not dense”) vs. BI-RADS C/D categories (“dense”)].
Model performance in FFDM
We show the estimated misclassification counts against radiologists’ reading in a confusion matrix in Fig. 1 using FFDM. The model was first evaluated in an internal validation composed of 20% of random samples from the WashU cohort that is left out from the training data. The BI-RADS classification as predicted by our proposed model exhibits close agreement with the radiologists’ scoring. The model agrees with the radiologists’ score 84% of time for women with nondense (BI-RADS A/B) breast and 91% of the time for women with dense breast (BI-RADS C/D). The confusion matrix separated for the four categories of BI-RADS are also displayed in Fig. 1. This resulted in a Cohen’s κ of 0.74 [95% confidence interval (CI), 0.73–0.75] for the interrater agreements using the four categories in the WashU cohort.
Similarly, when evaluating performance in the Emory external validation cohort, the proposed model–estimated BI-RADS density agrees with the radiologists’ score 87% of time for women with nondense (BI-RADS A/B) breast and 84% of the time for women with dense breast (BI-RADS C/D). The confusion matrix separated for the four categories of BI-RADS are also displayed in Fig. 1. Importantly categories B and C had high agreement with radiologists (B 81% and C 77%) in the external validation cohort. This resulted in a Cohen’s κ of 0.72 (95% CI, 0.71–0.73) for the interrater agreements using the four categories in the external validation cohort.
Model performance in synthetic DBT
We further show the estimated misclassification counts in a confusion matrix in Fig. 2 when using the synthetic DBT images. We see similar performances in this case when compared when FFDM. In the WashU internal validation cohort, the model agrees with the radiologists’ score 84% of time for women with nondense (BI-RADS A/B) breast and 90% of the time for women with dense breast (BI-RADS C/D). The separate results for the four categories are displayed in Fig. 2, resulting in a Cohen’s κ of 0.74 (95% CI, 0.73–0.75).
When evaluated in the external cohort, the proposed model–estimated BI-RADS density agrees with the radiologists’ score 90% of time for women with nondense (BI-RADS A/B) breast and 80% of the time for women with dense breast (BI-RADS C/D). Importantly, in the four-category setting, categories B and C had high agreement with radiologists (B 83% and C 74%) in the external validation cohort. The confusion matrix separated for the four categories of BI-RADS are also displayed in Fig. 2, resulting in a Cohen’s κ of 0.71 (95% CI, 0.69–0.73).
Subgroup analysis
In this study, we report results in our external validation dataset for the subset of women who were diagnosed with breast cancer in their follow-up since baseline. Supplementary Fig. S2 shows the results obtained by FFDM, and Supplementary Fig. S3 demonstrates the results obtained by synthetic DBT.
Discussion
With the widespread use of DBT in the United States, it is imperative that automated density measures are readily available. In this study, we report an automated tool to assess mammographic density in both FFDM and synthetic DBT mammography. The deep learning algorithm calibrates well with the radiologist rating in BI-RADS in both the internal validation and the independent external validation that is racially diverse with 42% of NH Black women. There are strong agreements between the deep learning algorithm with radiologists when comparing dense versus nondense in the external validation. When looking at the four categories of density, the Cohen’s κ was 0.71 to 0.72 in FFDM and synthetic DBT for the interrater agreements in the external validation cohort, which is stronger than those reported in other studies. Thus, this study extends beyond previous works, in which the women were largely limited to NH White women.
Most automated breast density estimation tools focus on density estimation using FFDM images. Given the uptake of DBT in the United States achieving more than 80% coverage in screening mammography exams as of 2021 (26), the proposed deep learning algorithm is timely and among the first that can accommodate synthetic DBT images.
The performance of the tool differs between data sets. We do not know the number of radiologists at Emory or their distribution across community settings that include two community hospitals, one large inner-city hospital, and one private academic hospital. This contrasts with WashU, where breast radiologists worked and read images in one location. Thus, variation between data sets may reflect usual practice to practice variation (36). Although these screening centers have ongoing quality improvement/quality assurance programs in place as required by the FDA, we do not have access to these internal performance data (37).
Our proposed algorithm has advantages over existing mammographic density estimation tools that function for both FFDM and DBT. For instance, Volpara and Quantra both require raw or “for process” images as input; however, most institutions do not store those for more than a month, meaning that exams cannot be subsequently reprocessed after acquisition (21). We overcome that burden in this study by using processed or “for presentation” FFDM and synthetic DBT images that are permanently archived. This algorithm could tie into routine breast imaging services and deliver output of density and BI-RADS category to the reading radiologist as an aid to classifying density, which is now a reportable feature in the United States (37). This is similar to other programs in use and aims to reduce variability among providers over time (38).
In the external validation, we obtained a Cohen’s κ of 0.72 (95% CI, 0.71–0.73) for FFDM and 0.71 (95% CI, 0.69–0.73) for synthetic DBT. When comparing interrater agreements of FFDM with Volpara v1.5.0 and Quantra v2.0 in previous studies (21), Volpara reported a Cohen’s κ of 0.57 (95% CI, 0.55–0.59) and Quantra reported a of 0.46 (95% CI, 0.44–0.47). Additionally, the outputted BI-RADS density is achieved by grouping a continuous measure of density estimated from the proposed algorithm. Such a continuous measure may be more sensitive when studying changes in density over time (3).
There are limitations to the study. Given the clinician-to-clinician variability in reporting BI-RADS density, a real-time assessment with multiple radiologists reading the same set of images will be more robust. This study has been largely limited to images generated on Hologic machines. Broader evaluation on other manufacturers is warranted. Experience using transfer learning to adapt to the DBT setting reassures us that this is not a major issue moving forward.
Despite these limitations, this study has several strengths. The external mammography data are drawn from diverse external validation data sources, including mammography screening in an urban Atlanta clinical service. The model can use synthetic DBT or “for presentation” images from synthetic DBT adding to access. These real-world data add to the generalizability of the model’s validation.
In conclusion, we provide a consistent and fully automated BI-RADS estimation method for both FFDM and synthetic DBT using the same deep learning model. The software can be easily implemented in all clinical practices.
Authors’ Disclosures
S. Jiang reports grants from the NCI during the conduct of the study; in addition, S. Jiang has a patent pending. G.A. Colditz reports grants from the NCI during the conduct of the study; in addition, G.A. Colditz has a patent pending. No disclosures were reported by the other authors.
Authors’ Contributions
S. Jiang: Conceptualization, resources, software, formal analysis, validation, investigation, visualization, methodology, writing–original draft, writing–review and editing. D.L. Bennett: Investigation, writing–review and editing. S. Chen: Data curation, writing–review and editing. A.T. Toriola: Investigation, writing–review and editing. G.A. Colditz: Conceptualization, resources, data curation, validation, investigation, methodology, project administration, writing–review and editing.
Acknowledgments
This work was supported by the NCI (R01CA246592 awarded to A.T. Toriola, R37CA256810 awarded to S. Jiang, and P30CA091842 awarded to T.J. Eberlein).
Note: Supplementary data for this article are available at Cancer Prevention Research Online (http://cancerprevres.aacrjournals.org/).