## Abstract

Mammographic density (MD) is an established predictor of breast cancer risk. However, there is limited information on the robustness of the risk associations for different study designs and the associated methodologic challenges. Our analysis includes 165 samples from studies published since 2006. We use a weakly informative Bayesian approach to avoid unduly optimistic estimates of uncertainty, as found in the previous literature. We find that the existing consensus from previous review studies has underestimated the strength and precision of MD as a risk marker. Moreover, although much of the published literature is based on categorical measurement of MD, there are tangible advantages in using continuous data in terms of estimate precision and relevance for different patient populations. Estimates based on the percentage of MD are more precise for lower density women, whereas absolute MD has advantages for higher density. We show that older results might not be a good proxy for current and future findings, and it would be pertinent to adjust clinical interpretations based on the older data. Using an appropriate estimation method cognizant of the importance of heterogeneity is critical to obtaining reliable and robust clinical findings that are relevant for broad patient populations.

This article is featured in Highlights of This Issue, p. 1

## Introduction

The objective of this study is to provide an updated comprehensive meta-analysis on the association between mammographic density (MD) and breast cancer, with a particular focus on methodologic challenges that arise using MD data. Since the influential contribution by McCormack and dos Santos Silva (MCDSS; ref. 1), a few more systematic reviews on the subject have been published, but these are either only descriptive and qualitatively focused or restricted to specific questions, such as results for cancer subtypes, menopausal status, or specific populations, and therefore also limited in sample size (2–8). Hence, there is a need for condensing the breadth of the relevant findings of the last 10 years. Although MD has become a widely accepted risk factor for breast cancer, there is still little guidance on the overall precision and robustness of this association for different MD measurement and analysis strategies.

Our analysis goes beyond the scope of a typical meta-analysis. We take the opportunity to compare our results, based on more recent data, with those in ref. 1. This analysis is motivated by a broader interest in exploring to what extent results from older studies are good predictors of what we would expect to find based on current and possibly future data. Because there have been a plethora of technological developments and new insights in the appropriate measurement and analysis of MD over a relatively short period of time, it is a valid and clinically important hypothesis that these changes might have influenced findings on the MD–breast cancer association. If this was the case, there might be a need to adjust any reference knowledge used in clinical practice which may be based on older data.

As a key focus of this paper, we specifically discuss methodological challenges that arise for MD data and similar epidemiologic analyses where relatively few studies are available and within- and between-study variance are nonnegligible. In particular, in the applied literature, discussion of effect estimates often seems to take precedence over a thorough and methodologically careful analysis of heterogeneity. In this paper, we demonstrate that appropriate estimation of heterogeneity is critical to obtaining reliable and robust clinical findings that are relevant for broad patient populations.

## Materials and Methods

### Search strategy and selection criteria

We apply the same search criteria as in ref. 1 (English-language journals, “breast” or “mamm*” in any field and “cancer*,” “malig*,” “carcin*”, or “neopl*” in the title and at least one of the following terms in the title: “breast densit*,” “parenchym*,” “mammo* pattern*,” “radiological pattern*,” “Wolfe*,” “Tabar*,” “BIRAD*,” “mammo* feature*,” “breast pattern*,” “mammo* densit*”, or “tissue densit*”). We consider all relevant studies providing original findings (excluding systematic reviews and comments) published online or in print since the cutoff date in ref. 1—November 18, 2005, and September 15, 2016—for inclusion. To avoid double-counting, we exclude from the analysis articles that use the same underlying data, with those that use larger parts of the original data, consider wider MD ranges, more subpopulations and/or covariates as well as more recent articles preferred.

In the time since the publication of the study (1), qualitative measures of MD have been increasingly abandoned in favor of both categorical and continuous quantitative measures. We follow this development and exclude any purely qualitative measurement classifications, such as the Wolfe grade and the Tabar scale, but we still include qualitative measurements as part of categorical classifications that are suitable for both quantitative and qualitative measurements. We also analyze absolute MD (AMD), which has not been considered in ref. 1. It is common for MD to be measured in terms of quantiles of the sample distribution or other sample-specific quantitative scales. To create an objective scale for including these results, fixed categorical classifications are defined for percentage of MD (PMD; with categories 0%–10%, 10%–25%, 25%–50%, 50%–75% and above 75%) and AMD (with categories below 20 cm^{2}, 20–35 cm^{2}, 35–50 cm^{2} and above 50 cm^{2}). Following (1), estimates based on similar scales with different cutoff values were assigned to the category that contains their midpoint. Studies had to be excluded if data transformations based on sample characteristics, such as standard deviations, were used without reporting the transformation factors, because it was impossible to objectively classify them. Hence, our analysis includes continuous and categorical PMD and AMD as well as the BIRADS classification (9) due to its extensive use in some countries.

For each study, we extracted the findings in terms of the odds ratio, standard error, and 95% confidence interval as well as information on study characteristics. A detailed list of the included studies is provided in Supplementary Table S1 (Supplementary Document 1). The data are available upon request from the authors. We identified a total of 64 relevant studies, published since the end of 2005 (Supplementary Table S2, Supplementary Document 1). Of these, 38 are based on distinct underlying data sets (Supplementary Table S1, Supplementary Document 1). This is the sample that was used, providing sample sizes between 6 and 26 studies per measurement category. Hence, some of the included studies are used to provide data for several measurement scales, but the underlying samples are distinct for each analysis.

### Statistical analysis

Random-effects models were estimated to obtain estimates of the overall effect size μ as well as its variation. We argue that standard estimation strategies (any estimation method that does not make any further assumptions on the distribution of the heterogeneity parameter τ, such as the maximum likelihood, restricted maximum likelihood, and method-of-moments estimators) are insufficient, because they have problems with consistently dealing with small sample size and large within-study variance (10). Both are particular concerns for meta-analysis of the MD–breast cancer literature, given the relatively small number of available studies for each density classification and the typically large confidence intervals around study effect estimates (Supplementary Table S1, Supplementary Document 1).

One way such data characteristics may be reflected in estimation results is in terms of boundary solutions for the between-study heterogeneity parameter τ. Boundary estimates are likely statistical artifacts, because standard methods only define τ for nonnegative values |\hat{\tau } \in{\mathbb{\tf="TTca8cd593" {R}}^ + }$|. Accepting boundary solutions for τ results in underestimating the true level of uncertainty associated with the overall effect μ, since |\widehat {se}( {\hat{\mu }} ) = \sqrt {{\sum _i}{{( {s_i^2 + {\tau ^2}} )}^{ - 1}}} $|, where |\widehat {se}( {\hat{\mu }} )$| is the standard error estimate associated with the estimated overall effect and |s_i^2$| is the variance of the study-level random error |{\varepsilon _i}$|. Hence, estimated confidence intervals are too narrow. Therefore, it has been widely argued that a zero estimate is not sufficient for accepting the null hypothesis of zero variance between studies (11). In addition to giving a false sense of homogeneity of the evidence base, misestimation of τ also directly affects the estimated overall effect size and raises important questions relating to the robustness of its interpretation.

To address this issue, we used a weakly-informative Bayesian method, because it is an intuitive approach for incorporating prior beliefs about τ, specifically that there is positive between-study variation. Following (12), a gamma prior was assumed for τ with shape parameter α = 2 and scale parameter λ such that |\lambda ={ \frac{{\alpha - 1}}{{\tilde{\tau }}}}$|, where |\tilde{\tau }$| is a plausible value for τ. A gamma distribution with shape parameter > 1 always has a positive mode, even if the likelihood estimate for τ is zero. A large value of λ = 1000 was chosen, as suggested in refs. 12 and 13, in order to obtain a relatively flat distribution that does not impose unduly strong prior beliefs about τ. An objective uniform prior for μ was used to obtain a joint log posterior of |p( {\mu ,\tau |{\bm{y}}} ) = \ell ( {\mu ,{\tau ^2}} ) + ( {\alpha - 1} )\log \tau - \lambda \tau + c$|, where |\ell ( {\mu ,{\tau ^2}} )$| is the standard log-likelihood function and *c* is a constant. The respective mode of the marginal posterior densities for μ and τ gives the reported estimated effects and credible intervals (CI) are based on the highest posterior density interval. The estimation results were obtained using the R package bayesmeta (14).

To study potential sources of heterogeneity, mixed-effects meta-regression models, rather than a stratified approach, were used to avoid compromising sample size. A stratified analysis also does not provide any information regarding the causal relevance of a covariate as an effect modifier. For each (categorical) moderator variable considered, each resulting category was required to contain at least three observations to avoid capturing outliers. A Markov chain Monte Carlo–based estimation strategy was used for obtaining meta-regression results. The simulations were based on three chains with 5,000 burn-in iterations. Convergence was assessed by Gelman and Rubin's convergence diagnostic (15). The simulations were run using the R package runjags (16). The reported results are based on a specification that controls for all moderator variables that were found to have *P* values below 0.05 in preliminary univariate meta-regressions in order to control for potential confounding effects.

To facilitate comparison with the previous literature, several of the MD estimates from ref. 1 were reestimated using the above-mentioned Bayesian approach. We could not include 4 of the 14 studies covered by ref. 1; 1 because the results are based on time-to-event data and 3 because their reference category is too wide for a meaningful comparison with our results. For the BIRADS scale, the data in ref. 1 only include two studies that are also based on the same patient population. Because of the extremely small sample size (of effectively one) and to avoid double-counting, we did not analyze the BIRADS results reported in ref. 1.

A two-sample *t* test and a two-sample χ^{2} test for binned data were used to statistically test for differences between the probability distributions of the effect estimates based on our more recent data and the MCDSS data set. To assess the underlying differences driving the observed disparities in estimates, a meta-regression was conducted using the earliest year of mammography for the study population as a continuous moderator variable. A categorical specification where different regimes are identified for each decade of data collection (before the 1990s, 1990s, and 2000s) was also examined.

## Results

### Overall effect size: Continuous density measures

Table 1 and Figs. 1–4 provide the estimation results. For the studies published since 2006, the overall effect size was 1.132 for a 10-percentage point increase in MD. The corresponding estimate for AMD was also 1.132 for an increase of 10 cm^{2}. For both, the predicted probability of not having a positive association between MD and breast cancer was essentially zero (Table 1). Hence, strong confidence in MD as a risk factor for breast cancer is appropriate.

. | Overall OR estimate (95% CI) . | Heterogeneity estimate |{\bm{\tau }}$| (95% CI) . | I2 statistic in % . | Probability of |{\bm{\mu }} >\!\!0$| . | Predicted OR . | Predicted |{\bm{\tau }}$| . | Number of samples (studies) included . | Ratio of width 95% CI and OR . |
---|---|---|---|---|---|---|---|---|

Continuous PMD | 1.1317 (1.0656–1.2140) | 0.0092 (0.0051–0.0168) | 89.32% | 99.95% | 1.1352 | 0.0105 | 15 (8) | 0.0131 |

Continuous AMD | 1.1315 (1.0232–1.2725) | 0.0082 (0.0032–0.0247) | 83.38% | 98.65% | 1.1372 | 0.0120 | 7 (5) | 0.0218 |

Categorical PMD: 10%–25% | 1.4366 (1.3250- 1.5549) | 0.0679 (0.0072–0.1715) | 15.44% | 100% | 1.4360 | 0.0852 | 24 (21) | 0.1600 |

Categorical PMD: 25%–50% | 1.9845 (1.7427- 2.2632) | 0.1826 (0.0944–0.3229) | 51.20% | 100% | 1.9857 | 0.2018 | 21 (19) | 0.2622 |

Categorical PMD: 50%–75% | 2.5115 (2.0969- 3.0397) | 0.2855 (0.1670–0.4848) | 64.92% | 100% | 2.5197 | 0.3158 | 21 (19) | 0.3755 |

Categorical PMD: >75% | 3.8436 (2.1917–7.2227) | 0.2660 (0.0112–1.2550) | 41.00% | 99.77% | 3.9190 | 0.5035 | 6 (6) | 1.389 |

BIRADS–B | 1.625 (1.2964–2.1292) | 0.1650 (0.0153–0.5304) | 36.66% | 99.88% | 1.6503 | 0.2444 | 11 (10) | 0.5125 |

BIRADS–C | 1.9536 (1.2228–2.9829) | 0.4890 (0.2041–1.0696) | 80.57% | 99.39% | 1.9234 | 0.5939 | 11 (10) | 0.901 |

BIRADS–D | 2.0044 (1.1221–3.4231) | 0.5788 (0.1559–1.3264) | 74.1% | 98.69% | 1.9727 | 0.6982 | 11 (10) | 1.148 |

Categorical AMD: 20–35 cm^{2} | 1.4849 (1.2289–1.7695) | 0.1325 (0.0148–0.3888) | 43.53% | 99.89% | 1.4785 | 0.1832 | 9 (8) | 0.3641 |

Categorical AMD: 35–50 cm^{2} | 1.7625 (1.5521–2.0296) | 0.0686 (0.0044–0.2468) | 14.4% | 100% | 1.7716 | 0.1097 | 11 (9) | 0.2709 |

Categorical AMD: >50 cm^{2} | 2.2117 (1.9208–2.6363) | 0.1160 (0.0144–0.3247) | 35.74% | 100% | 2.2376 | 0.1562 | 11 (9) | 0.3235 |

. | Overall OR estimate (95% CI) . | Heterogeneity estimate |{\bm{\tau }}$| (95% CI) . | I2 statistic in % . | Probability of |{\bm{\mu }} >\!\!0$| . | Predicted OR . | Predicted |{\bm{\tau }}$| . | Number of samples (studies) included . | Ratio of width 95% CI and OR . |
---|---|---|---|---|---|---|---|---|

Continuous PMD | 1.1317 (1.0656–1.2140) | 0.0092 (0.0051–0.0168) | 89.32% | 99.95% | 1.1352 | 0.0105 | 15 (8) | 0.0131 |

Continuous AMD | 1.1315 (1.0232–1.2725) | 0.0082 (0.0032–0.0247) | 83.38% | 98.65% | 1.1372 | 0.0120 | 7 (5) | 0.0218 |

Categorical PMD: 10%–25% | 1.4366 (1.3250- 1.5549) | 0.0679 (0.0072–0.1715) | 15.44% | 100% | 1.4360 | 0.0852 | 24 (21) | 0.1600 |

Categorical PMD: 25%–50% | 1.9845 (1.7427- 2.2632) | 0.1826 (0.0944–0.3229) | 51.20% | 100% | 1.9857 | 0.2018 | 21 (19) | 0.2622 |

Categorical PMD: 50%–75% | 2.5115 (2.0969- 3.0397) | 0.2855 (0.1670–0.4848) | 64.92% | 100% | 2.5197 | 0.3158 | 21 (19) | 0.3755 |

Categorical PMD: >75% | 3.8436 (2.1917–7.2227) | 0.2660 (0.0112–1.2550) | 41.00% | 99.77% | 3.9190 | 0.5035 | 6 (6) | 1.389 |

BIRADS–B | 1.625 (1.2964–2.1292) | 0.1650 (0.0153–0.5304) | 36.66% | 99.88% | 1.6503 | 0.2444 | 11 (10) | 0.5125 |

BIRADS–C | 1.9536 (1.2228–2.9829) | 0.4890 (0.2041–1.0696) | 80.57% | 99.39% | 1.9234 | 0.5939 | 11 (10) | 0.901 |

BIRADS–D | 2.0044 (1.1221–3.4231) | 0.5788 (0.1559–1.3264) | 74.1% | 98.69% | 1.9727 | 0.6982 | 11 (10) | 1.148 |

Categorical AMD: 20–35 cm^{2} | 1.4849 (1.2289–1.7695) | 0.1325 (0.0148–0.3888) | 43.53% | 99.89% | 1.4785 | 0.1832 | 9 (8) | 0.3641 |

Categorical AMD: 35–50 cm^{2} | 1.7625 (1.5521–2.0296) | 0.0686 (0.0044–0.2468) | 14.4% | 100% | 1.7716 | 0.1097 | 11 (9) | 0.2709 |

Categorical AMD: >50 cm^{2} | 2.2117 (1.9208–2.6363) | 0.1160 (0.0144–0.3247) | 35.74% | 100% | 2.2376 | 0.1562 | 11 (9) | 0.3235 |

The range of CI for the OR was moderate for PMD (about 0.17 for a 10-percentage point increase), and for AMD, it was slightly wider (about 0.25 for a 10-cm^{2} increase). On the basis of the available data, a new study would be predicted to find an odds ratio of 1.135 for a 10-percentage point increase in PMD and 1.137 for a 10-cm^{2} increase in AMD.

### Overall effect size: Categorical density measures

For both the generic categorical and the BIRADS classifications, there was a consistently increasing risk gradient for PMD (see also Supplementary Fig. S1, Supplementary Document 2). For both, the probability of a negative association between MD and breast cancer was exactly or essentially zero (Table 1), mirroring the results for continuous MD.

As expected, CIs widen with increasing density categories for both the Boyd and BIRADS classifications. This was also evident from the statistic provided in Table 1 that relates the width of the CI to the effect estimate. An increase in the relative width of the CI from 0.17 to 1.39 for the categorical classification and from 0.51 to 1.15 across the BIRADS categories was observed. Hence, risk estimates became less precise for increasing PMD.

With increasing density, estimate precision decreases (see also Supplementary Fig. S2, Supplementary Document 2, which shows the estimated posterior density distributions for categorical PMD, centered at log odds of 1 and where the bold line section indicates the location and width of the estimated CI, respectively). Because of this increased uncertainty, predicted effect estimates for additional studies also deviated further from the estimated overall odds ratios for higher density categories (Table 1). In absolute terms, the CIs were about a third wider for BIRADS-classified than for categorical PMD effects (average of 0.55 vs. 0.85 relative to the size of the effect estimate).

The estimated risk gradient for a categorical classification of AMD is not characterized by a substantial widening in CIs for higher density categories, as was observed for PMD. This was also evident from the results in Table 1 and the centered density plot in Supplementary Fig. S2. This suggests that there is no loss in precision when moving up the risk gradient. In fact, the effect estimate gains precision. These results suggest that AMD is a more precise predictor of breast cancer risk for higher density women.

We have included all available studies to conduct the above analyses and the analyses for continuous MD in the “Overall effect size: Continuous density measures” section, not just those considering both PMD and AMD (which would have restricted our sample to only six studies for categorical MD and only three studies for continuous MD). This allows us to maintain adequate sample sizes where we can be confident that variance and ultimately effect size estimates are sufficiently precise. At the same time, we found, as further detailed in the “Meta-regression” section, that there is very limited evidence that there are systematic differences between key study characteristics across the data that would warrant a restricted analysis that would effectively omit the majority of the available studies from consideration. For exploratory purposes, we have performed an analysis by using only studies that report both PMD and AMD. The results obtained for the entire data set hold supporting our conclusion that AMD is a more precise predictor of breast cancer risk for higher density women and that relying on the entire data set for analysis improves the statistical properties without influencing the result. These results can be obtained from the authors upon request.

### Heterogeneity

For continuous PMD, we estimated a low level of between-study heterogeneity of 0.009. This estimate was also relatively precise (CI, 0.005–0.017). At the same time, between-study heterogeneity accounted for more than 89% of overall observed variance (Table 1). *I*^{2} is generally interpreted as low if it falls below 50%, moderate between 50% and 75%, and high above 75%. An *I*^{2} value close to 0% indicates a very spurious result. Hence, for PMD, differences in study designs and other systematic disparities between studies were minimal, can be precisely estimated, and were the main driver of overall variability. Therefore, we can be confident in the estimated effect of PMD on breast cancer.

Similarly, we found a high level of precision for continuous AMD, which was also characterized by a large *I*^{2} statistic (τ = 0.008, *I*^{2} = 83%, see Table 1). An additional study would be predicted to increase heterogeneity slightly to a point estimate of 0.01 for PMD and 0.012 for AMD.

The heterogeneity estimates for categorical PMD were substantially larger and less precise. Table 1 shows τ estimates between 0.07 and 0.58 that were increasing with higher density categories, although more moderately for the highest density categories (above 75% and BIRADS D), where within-study variance was an important factor. Heterogeneity among BIRADS-classified studies was higher than that for studies based on the categorical scale. The average τ is about twice as large for BIRADS studies (0.19 vs. 0.41; peak τ for categorical PMD is 0.27 for the above 75% category). In addition, average *I*^{2} was also substantially larger for BIRADS-measured MD (43% for categorical PMD vs. 64% for BIRADS). For both categorical classifications and particularly for BIRADS-measured MD, we have less confidence that the overall effects have general applicability compared with the continuous PMD results.

For AMD, a lower decrease in precision was observed when comparing continuous with categorical results. Categorical AMD precision levels were generally lower than those for categorical PMD, ranging between 0.07 and 0.13. They also do not systematically increase with higher density categories, and their respective *I*^{2} statistics fall in the low range below 50%. This suggests that the overall effect estimate is relatively precise, but the odds ratios reported in the underlying studies are characterized by substantial imprecision.

### Importance of a robust estimation strategy

To provide a visual demonstration of the importance of choosing a robust estimation strategy, the probability distributions of the effect estimates for categorical PMD data are used. The Bayesian estimation method naturally provides such a (posterior) density distribution. For estimated effects obtained from a frequentist estimator, it is obtained from drawing a large number of observations randomly from a normal distribution with mean and standard deviation set to the estimated |\hat{\mu }\ $| and its standard error. Supplementary Fig. S3 (Supplementary Document 2) shows substantial differences in the shape of the curves (capturing 99% of the probability mass in each plot, respectively) for the 10% to 25% and above 75% categories. The distributions based on the Bayesian estimator are flatter, attributing a higher level of uncertainty to the point estimate. In contrast, for the two intermediate categories, the distributions based on the Bayesian approach as used in this paper and a standard frequentist estimator are almost congruent.

If a standard frequentist estimation method was used, boundary estimates would be estimated for τ for categorical PMD (10%–25% and above 75% categories) and AMD (35–50 cm^{2} and above 50 cm^{2} categories). For the 10% to 25% and above 75% PMD categories, the Bayesian estimation strategy produces τ estimates of 0.07 and 0.27, respectively (Table 1). Even the former is substantially larger than the τ estimate for continuous PMD of 0.009, which does not produce a boundary solution using a standard estimator. Hence, not all analyses for low levels of true between-study heterogeneity will generate boundary solutions. Instead, it is specific data characteristics that cause this issue. The reason a boundary estimate is obtained for these data is likely related to the substantial degree of within-study variance. This is evident from the *I*^{2} statistic. Using the Bayesian estimator, we obtain *I*^{2} values of 15% and 41% for 10% to 25% and above 75% PMD. Hence, the majority of variance stems from the imprecision of the results reported in the included studies. For the AMD categories 35 to 50 cm^{2} and above 50 cm^{2}, the Bayesian estimator yields τ values of 0.07 and 0.12 and *I*^{2} statistics of 14% and 36%, respectively. Hence, using a more robust estimation strategy reveals a nonnegligible degree of between-study heterogeneity that needs to be taken into account when interpreting the estimation results. Therefore, using the Bayesian estimator provides us with a more robust estimation result, independently of data characteristics.

### Meta-regression

Meta-regression was used to identify potential moderators of the observed heterogeneity between studies. Across all categorical and continuous MD classifications, the strongest evidence was found for matched study design (continuous PMD, 25%–50% and 50%–75% PMD), use of an automatic thresholding program (25%–50% and 50%–75% PMD), adjustment for key variables (50%–75% PMD) and invasive cancer type (50%–75% PMD) as moderator variables.

Matched study designs, compared with those without matching, are predicted to find a 1-percentage point higher risk of breast cancer for a 10% increase in PMD or about 72-percentage points higher risk for women with 10% to 25% PMD (compared with those below 10% PMD). Similar interpretations apply to the other effects in Table 2.

. | . | Matching^{a}
. | Thresholding^{b}
. | Adjustment^{c}
. | Invasive^{d}
. |
---|---|---|---|---|---|

Continuous PMD | |\hat{\beta }$| (95% CI) | 0.0114 (0.0012–0.0206) | |||

OR intercept (95% CI) | 1.0847 (1.0422–1.1549) | ||||

OR incl. |\hat{\beta }$| (95% CI) | 1.0972 (1.0435–1.1790) | ||||

Categorical PMD: 25%–50% | |\hat{\beta }$| (95% CI) | 0.4077 (0.1726–0.6443) | 0.3051 (0.0213–0.5408) | ||

OR intercept (95% CI) | 1.4475 (1.173–1.7678) | 1.5928 (1.2995–2.0009) | |||

OR incl. |\hat{\beta }$| (95% CI) | 2.1760 (1.3873–3.3904) | 2.1611 (1.3256–3.4764) | |||

Categorical PMD: 50%–75% | |\hat{\beta }$| (95% CI) | 0.5118 (0.1698–0.8712) | 0.4242 (0.0277–0.7616) | 0.4284 (0.0453–0.7673) | −0.3970 (-0.8094–0.0001) |

OR intercept (95% CI) | 1.6468 (1.223–2.2797) | 1.9034 (1.4004–2.5696) | 1.9073 (1.3548–2.5157) | 3.3720 (2.3980–4.7325) | |

OR incl. |\hat{\beta }$| (95% CI) | 2.7465 (1.4359–5.4943) | 2.9090 (1.4584–5.6765) | 2.9273 (1.4044–5.4690) | 2.2670 (1.0612–4.8141) |

. | . | Matching^{a}
. | Thresholding^{b}
. | Adjustment^{c}
. | Invasive^{d}
. |
---|---|---|---|---|---|

Continuous PMD | |\hat{\beta }$| (95% CI) | 0.0114 (0.0012–0.0206) | |||

OR intercept (95% CI) | 1.0847 (1.0422–1.1549) | ||||

OR incl. |\hat{\beta }$| (95% CI) | 1.0972 (1.0435–1.1790) | ||||

Categorical PMD: 25%–50% | |\hat{\beta }$| (95% CI) | 0.4077 (0.1726–0.6443) | 0.3051 (0.0213–0.5408) | ||

OR intercept (95% CI) | 1.4475 (1.173–1.7678) | 1.5928 (1.2995–2.0009) | |||

OR incl. |\hat{\beta }$| (95% CI) | 2.1760 (1.3873–3.3904) | 2.1611 (1.3256–3.4764) | |||

Categorical PMD: 50%–75% | |\hat{\beta }$| (95% CI) | 0.5118 (0.1698–0.8712) | 0.4242 (0.0277–0.7616) | 0.4284 (0.0453–0.7673) | −0.3970 (-0.8094–0.0001) |

OR intercept (95% CI) | 1.6468 (1.223–2.2797) | 1.9034 (1.4004–2.5696) | 1.9073 (1.3548–2.5157) | 3.3720 (2.3980–4.7325) | |

OR incl. |\hat{\beta }$| (95% CI) | 2.7465 (1.4359–5.4943) | 2.9090 (1.4584–5.6765) | 2.9273 (1.4044–5.4690) | 2.2670 (1.0612–4.8141) |

^{a}Matching is defined as use of matched case–control study setup; reference category: no matching.

^{b}Thresholding is defined as use of an automatic thresholding program; reference category: no use of automatic thresholding.

^{c}Adjustment is defined as inclusion of confounders, at a minimum age and/or body mass index (BMI); reference category: no adjustment.

^{d}Invasive is defined as invasive cancer type; reference category: noninvasive cancer.

There is no evidence for differences in results based on the following categories: digital versus film images, mammographic view, qualitative versus quantitative measurement, invasive versus noninvasive cancer diagnosis, incident versus prevalent MD, or high-risk versus low-risk women (by family history of breast cancer).

## Comparison with the Previous Literature and the Impact of Time of Data Collection

The results from our reestimation of the MCDSS (1) data set, covering all studies on the MD–breast cancer association before 2006, are provided in Supplementary Table S3 (Supplementary Document 3). Notably, of the 11 estimation results for categorical PMD originally reported in ref. 1, 9 have a zero heterogeneity estimate, while our reestimated heterogeneity levels were consistently large, suggesting that there is in fact substantial heterogeneity between studies. We also found relatively low values for the *I*^{2} statistic, suggesting that within-study variance was substantial, which is a likely driver of the boundary estimates reported in ref. 1.

Comparing these results with the estimates for our data set covering the more recent literature, we found differences in both the size and precision of effect estimates. Figure 5 shows a bar chart plotting the overall effect estimate and its confidence interval based on the (1) data using different estimation strategies and compares them with the estimation results for our data set. The estimated effect for our more recent data was larger, except for the 10% to 25% PMD category. The probability that our effect estimates were lower (for 10%–25% PMD)/higher (for the other categories) than the MCDSS's point estimates was large (above 80%) for all categories (see Supplementary Table S3 and Supplementary Fig. S4, Supplementary Document 3).

Moreover, it is evident from Fig. 5 that CIs were narrower for the more recent data. However, comparing them only to MCDSS's original estimates based on standard frequentist methodology may give the impression of broadly similar levels of precision. This is misleading, as evident when considering the Bayesian estimates for the MCDSS data, which have much wider CIs, with differences between frequentist and Bayesian CIs ranging from 0.33 for Boyd category C to more than 4.80 for Boyd category F (Supplementary Table S3, Supplementary Document 3). Hence, the Bayesian methodology suggests that a lower degree of confidence in the exact size of the effect estimate is warranted. Although the Bayesian estimator provides more conservative precision estimates, at the same time, precision has improved over time.

To statistically test these observed differences, the estimate distributions generated from the MCDSS sample and from the new data were compared. Two underlying data sets were used in studies published both before and after 2006; these were included in the more recent sample only (other ways of treating them, i.e., inclusion in the MCDSS sample, in both samples or exclusion, does not qualitatively affect the findings). Consistent evidence of differences was found across all categories, but the 10% to 25% PMD category (Table 3), in which the observed differences in the estimated effect size were much smaller.

. | Categorical PMD: 10%–25% . | Categorical PMD: 25%–50% . | Categorical PMD: 50%–75% . | Categorical PMD: >75% . |
---|---|---|---|---|

t test | 0.1259 | 0.0352 | 0.0085 | 0.0009 |

χ^{2} test | 0.0846 | 0.0120 | 0.0008 | 0.0012 |

. | Categorical PMD: 10%–25% . | Categorical PMD: 25%–50% . | Categorical PMD: 50%–75% . | Categorical PMD: >75% . |
---|---|---|---|---|

t test | 0.1259 | 0.0352 | 0.0085 | 0.0009 |

χ^{2} test | 0.0846 | 0.0120 | 0.0008 | 0.0012 |

NOTE: A two-sample *t* test and a two-sample χ^{2} test for binned data were used to statistically test for differences between the probability distributions of the effect estimates based on our more recent data versus the MCDSS data set.

^{a}*P* values for *H*_{0} = no difference between distributions.

One intuitive explanation for these differences between our results and those reported by the MCDSS is time of data collection. This is further explored in Supplementary Document 3. It is plausible that changes in MD measurement technology and study protocols may have had an impact on the data. However, because relatively older data are still frequently used in academic and clinical settings (e.g., as part of large multiyear data sets), it is important to evaluate to what extent this may influence findings on the MD–breast cancer association. To further study the impact of using older data, we consider the year of mammography directly as a moderator variable for the MD–breast cancer association. This information was obtained from the study population descriptions provided in the studies.

We found that year of mammography had a negative effect on the effect of PMD on breast cancer for the 10% to 25% category (−0.008; see Supplementary Table S4, Supplementary Document 3). With each year later that data collection commences, the predicted effect of MD on breast cancer decreased by about 1-percentage point. In this model, it continued falling linearly with time (see Supplementary Fig. S5, Supplementary Document 3, in red). However, this effect can no longer be distinguished from zero for higher density categories. Detailed results can be obtained upon request from the authors. To what extent the nonsignificance of these results might be driven by the lower sample size available for some of the larger density categories would be an interesting area for future research.

The above model assumed a linear relationship between time of data collection and the MD–breast cancer association. If we instead consider a categorical specification that allows for different regimes for studies with data collection commencing before and after 1990 (Supplementary Table S4 and Supplementary Fig. S5 in blue, Supplementary Document 3), the model predicts an average odds ratio of 1.587 for studies with data collection before 1990. This is predicted to fall to 1.348 for more recent data. This suggests that the biggest changes have occurred in the 1990s, but further work on this issue is required.

## Discussion

This paper confirms the relevance of MD as a risk factor for breast cancer by systematically reviewing the growing literature on this topic. Independently of how MD is measured, the probability of a zero or negative association between MD and breast cancer is essentially zero. However, we find tangible advantages for using continuous over categorical measurement in terms of the reliability of the estimates. This is particularly relevant, because it cautions against the current trend where the majority of published studies on the effect of MD are based on categorical data. Estimates for continuous PMD are more precise, and heterogeneity is driven by between-study rather than within-study differences. Therefore, we can attach a higher level of confidence to the effect estimates and their general relevance for different patient populations compared with those based categorical data. In particular, we observe that precision is lost when moving up the risk gradient from lower to higher density categories. This is less of a concern for AMD categories. However generally, studies based on AMD are characterized by greater imprecision due to within-study variance. Hence, we cannot provide a universal recommendation that either PMD or AMD is more reliable, but PMD seems to be the preferred measure for lower density, whereas AMD has advantages for higher density women.

In general, we argue for a greater focus on heterogeneity when preparing and evaluating meta-analyses such as this one. As argued in “Statistical analysis,” relying routinely on standard frequentist methods for estimating random effects can result in underestimating the true level of uncertainty associated with the overall effect and thus obtaining confidence intervals that are too narrow. If heterogeneity is underestimated, this suggests that there are a range of (possibly unknown) factors that mediate the true effect, potentially compromising its suitability as a general risk marker.

We used the weakly-informative Bayesian estimator proposed in ref. 12, because it allowed us to incorporate prior beliefs about the positive nature of the between-study heterogeneity parameter without imposing unduly strong prior assumptions. When comparing estimates based on this Bayesian estimator with those based the standard frequentist approach, we showed that the former provides more robust and more conservative results, independent of data characteristics. We also found that focusing solely on comparing effect estimates that seem similar across estimation methods would have been misleading, because it does not provide information about the level of certainty appropriate for these point estimates.

In terms of potential sources of heterogeneity, we found some evidence for a number of candidate moderator variables, but overall, we caution against placing too much emphasis on any of the meta-regression results, because the overall sample size is relatively small and the effects cannot be consistently replicated across MD measurement approaches. It would be interesting to revisit this analysis in the future when more studies are available. However, at this point, the evidence for effective differences across different groups of patients or study designs is weak.

We compared our results with those from the influential meta-analysis in ref. 1. We reestimated the results of ref. 1 using the Bayesian methodology to incorporate further advances in the methodologic literature since 2006 for small and noisy data sets. We found differences in both the size and precision of effect estimates. For more recent data, the estimated overall effect of PMD on breast cancer was smaller for lower levels of density and larger for moderate to high levels of density. Most importantly, we also observed an increase in precision of the estimated effects for more recent data.

Looking more closely at the effect of the time of data collection on study results, we found that the true effect of PMD on breast cancer decreases over time, at least for lower levels of MD. Further nonlinear analysis shows that this effect is particularly pronounced when comparing the effect of PMD measurements obtained from the 1990s onward with the predicted breast cancer risk based on earlier mammograms. More data are needed to evaluate more recent changes (in screening technology and other measurement and study design related factors).

Combining all these findings, we conclude that older results might be less reliable in predicting what we would expect to find with future studies of MD and breast cancer. Because we can also show that estimate precision has increased for more recent studies, an intuitive interpretation is an improvement in the quality of the available data such that results from older studies might be less reliable. Given these changes in results seen over time, it would be important to adjust clinical interpretations based on older data.

Given the findings of this analysis, we would also like to reiterate the need to pay particular attention to statistical methodology when reviewing studies on MD and breast cancer (and other epidemiologic associations where relatively few studies are available and within- and between-study variance are nonnegligible). This is critical in containing the influence of noise due to differing data quality and study settings on the reliability of results and the derived clinical interpretations. Our findings add further weight and urgency to the clinical case for using MD as a key risk factor for breast cancer, because this updated evidence suggests that the existing consensus from previous review studies has underestimated the strength of MD as a risk marker for breast cancer in terms of its reliability and robustness.

## Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

## Acknowledgments

The authors thank the two anonymous reviewers for helpful comments on an earlier version of the manuscript. The authors also acknowledge funding from the Cancer Council Western Australia/Cancer Australia and the National Breast Cancer Foundation.

This work was supported by grant APP1085750 from the Cancer Council Western Australia/Cancer Australia and a Research Fellowship (ECF-17-010) from the National Breast Cancer Foundation (both awarded to J. Stone).

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked *advertisement* in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.