Incorporating Alternative Polygenic Risk Scores into the BOADICEA Breast Cancer Risk Prediction Model

Abstract Background: The multifactorial risk prediction model BOADICEA enables identification of women at higher or lower risk of developing breast cancer. BOADICEA models genetic susceptibility in terms of the effects of rare variants in breast cancer susceptibility genes and a polygenic component, decomposed into an unmeasured and a measured component - the polygenic risk score (PRS). The current version was developed using a 313 SNP PRS. Here, we evaluated approaches to incorporating this PRS and alternative PRS in BOADICEA. Methods: The mean, SD, and proportion of the overall polygenic component explained by the PRS (α2) need to be estimated. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document} was estimated using logistic regression, where the age-specific log-OR is constrained to be a function of the age-dependent polygenic relative risk in BOADICEA; and using a retrospective likelihood (RL) approach that models, in addition, the unmeasured polygenic component. Results: Parameters were computed for 11 PRS, including 6 variations of the 313 SNP PRS used in clinical trials and implementation studies. The logistic regression approach underestimates \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document}, as compared with the RL estimates. The RL \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\alpha $\end{document} estimates were very close to those obtained by assuming proportionality to the OR per 1 SD, with the constant of proportionality estimated using the 313 SNP PRS. Small variations in the SNPs included in the PRS can lead to large differences in the mean. Conclusions: BOADICEA can be readily adapted to different PRS in a manner that maintains consistency of the model. Impact : The methods described facilitate comprehensive breast cancer risk assessment.


Introduction
BOADICEA (1, 2) is a risk prediction algorithm for predicting breast and ovarian cancer risk on the basis of genetic and nongenetic factors. The algorithm incorporates the effects of common genetic variants, summarized in a polygenic risk score (PRS), in addition to the effects of pathogenic variants in major breast cancer susceptibility genes, other lifestyle/hormonal risk factors, and cancer family history.
The current version (v6; refs. 1, 2) has been specifically developed to incorporate the 313 SNP PRS of Mavaddat and colleagues (3); this PRS was developed using the very large data set of the Breast Cancer Association Consortium (BCAC) and extensively validated in prospective studies. However, as larger genome-wide association studies (GWAS) and novel statistical methods become available, new PRS are being continually developed. In addition, PRS developed for clinical translation and generated in different health care systems use a variety of technologies, including both targeted sequencing panels and genotyping arrays, and surrogate SNPs are often required. The BOADICEA algorithm itself is flexible and can incorporate any PRS for which the relevant parameters are known. These parameters are the mean (m) and SD (s) of the PRS in the population, and the proportion (a 2 ) of the polygenic variance attributable to the PRS. In practice, the PRS can be normalized and supplied as a Z-score, in which case only parameter a is required. By modelling the PRS as the proportion of a (fixed) polygenic component, the predicted familial risks remain consistent, irrespective of the PRS used, and importantly, there is no double counting of the effect of the PRS and cancer family history.
Here, we discuss the incorporation of alternative PRSs into BOA-DICEA, and provide the relevant parameters for a number of PRS that have been developed, including several that are in use in clinical applications.

Materials and Methods
BOADICEA models breast cancer risks such that the incidence of breast cancer at age t is of the form (1, 4): Here l 0 ðtÞ is the baseline incidence. The term d gðiÞ ðtÞ models the major gene component for individual i (d k ðtÞ being the age-specific log-HR associated with genotype k). s P ðtÞx ðiÞ P models the polygenic component, s P ðtÞ being the polygenic SD and x ðiÞ P the normalized polygenic component for individual i. The final term models the effects of other risk factors. The polygenic variance s 2 P ðtÞ is allowed to be agedependent and assumed to be a linear function of age t: The parameters g and have been previously estimated, using complex segregation analysis, as 4.86 and À0.06 respectively (4).
The PRS is incorporated into BOADICEA by partitioning the total polygenic component x P into the sum of a known component x K measured by the PRS, and an unmeasured residual component x R ð1). The variance due to the known component is of the form (3): s K ðtÞ can also be interpreted as the age-specific log-HR per unit SD of the PRS, conditional on other risk factors. Note that in Mavaddat and colleagues (3) equation (B) is written s 2 K ðtÞ ¼ g 2 ðaþbtÞ. The change of symbols is for consistency with Lee and colleagues (1) and the Canrisk platform (www.canrisk.org), where the proportion of the polygenic variance explained by the PRS is denoted as a 2 .

Estimation of a and incorporating alternative PRS
The key parameter is a. The first approach to estimating this parameter makes the simplifying assumption that the polygenic SD of the known polygenic component in BOADICEA, s K ðtÞ can be approximated by the marginal age-specific log-HR per unit SD of the PRS (ref. 3; see Supplementary Methods). This can then be estimated using cohort data or (approximately, making the rare disease assumption) case-control data, by first transforming the PRS using: where x K is the standardised (per unit SD) version of the proposed PRS. S 0 is then included as a covariate in a Cox or logistic regression model: the parameter (log-HR or log-OR) estimate corresponding to the covariate S 0 gives the required a parameter, which we denote a GLM . This method was applied to 22,767 controls and 16,151 women diagnosed with invasive breast cancer from the validation and prospective test sets used in Mavaddat and colleagues (ref. 3; Supplementary Tables S1 and S2). The analysis was restricted to women of European ancestry with age of diagnosis or last observation less than 80 years [after application of inclusion/exclusion criteria, mean age at diagnosis ¼ 59.9 (SD ¼ 10) years for cases, and 57.1 (SD ¼ 10.4) years for controls]. Analyses were adjusted for country in which the study was conducted (15 countries) and 10 principal components. The above analyses make the simplifying assumption that the marginal PRS effect size is a good approximation to the effect size conditional on other risk factors. This is likely to be a reasonable assumption for nongenetic risk factors, which have relatively small effects on risk and appear to be independent of the PRS, as shown in recent analyses of the combined effect of breast cancer PRS and individual SNPs with life-style/environmental risk factors including questionnaire-based factors (5-10). However, it may not be true for other genetic factors, in particular the unmeasured polygenic component. Although the PRS and the residual polygenic component are assumed to be conditionally independent, individuals with a high polygenic component are more likely to develop the disease at an early age. This results in a negative correlation between the PRS and the residual polygenic component at later ages, which leads to an underestimation of the PRS effect size if the latter is not allowed for. To address this problem, we also estimated a using a retrospective likelihood approach ða RL ), applied to the same BCAC data set. In this analysis, the observed PRS is computed conditional on the phenotypes of the individuals (age of diagnosis and case-control status), explicitly allowing for the unmeasured polygenic component. Details are given in the Supplementary Methods. This approach requires overall population age-specific incidence rates to be specified. For this purpose, the rates for England and Wales 2016-2018 were used (https://www.cancerresearchuk.org/health-professional/cancerstatistics/incidence/age).
Because the mean PRS varies by country, we first regressed the PRS on country and principal components, adjusted for case-control status, and performed the analyses on the residual PRS. The likelihood was maximized using the optimize function in R. 95% confidence intervals (CI) were obtained using a grid of values for a RL , and finding the difference between the log-likelihoods and the maximum loglikelihood.
As a third approach, we derived an approximate estimate a from the log-OR per unit SD (h), by calibrating against PRS 313 as a standard. From equations (A) and (B) in the methods above it can be seen that, under the rare disease assumption, the marginal HR associated with the PRS should approximate the conditional HR. If differential age effects can also be ignored, a should therefore be approximately proportional to h. This allows a to be estimated using PRS313 as a standard. Thus: a APP ¼ h h 0 a 0 where h 0 and a 0 are the corresponding estimates for PRS313. This provides a simple method that could be applied to PRS developed and validated on a different data set.
We computed the relevant parameters for PRS313 and 10 additional PRS [Supplementary Tables S3 and S4; SNP positions based on Genome Reference Consortium Human Build 37 (GRCh37)].
PRS313 includes two variants (22_29203724_C_T and 22_29551872_ A_G) which are correlated with the protein truncating variant CHEK2 Ã 1100delC, and some of the derivative PRS also include these SNPs. This could result in overestimation of risk in CHEK2 Ã 1100delC carriers if the PRS is used in conjunction with gene-panel testing, because BOADICEA assumes that the PRS and major gene genotypes are independent in the population. We therefore also considered PRS without these variants. (Note that CHEK2 p.I157T (22_29121087_A_G) is also included in PRS313 but is only weakly correlated with CHEK2 Ã 1100delC and does not introduce a bias). The means and SDs of each PRS, and the proportion (a) of the polygenic variance attributable to these alternative PRS were derived in the same data set (Supplementary Tables S1 and S2), namely the validation and prospective sets described by Mavaddat and colleagues (3).
All studies included in this analysis were approved by the relevant local ethical review boards and used appropriate consent procedures. SEARCH was approved by the NRES Committee East of England -Cambridge South.

Data availability
Data were generated by the authors and is available on request.

PRS examples
Eleven alternative PRS were constructed. Six of these are modifications of the PRS313, designed for clinical implementation. The BRIDGES PRS was developed as an next-generation sequencing (NGS) panel test to facilitate clinical translational studies of BOADI-CEA implemented in the context of genetic testing of women with a family history (https://bridges-research.eu/). Of 313 variants, 295 could be designed and a further 11 were replaced by surrogate markers (r 2 > 0.9 in Europeans). The PERSPECTIVE I&I PRS was designed to facilitate risk stratified screening in the context of population-based mammographic screening in Ontario and Quebec (11). This PRS was designed as an NGS panel: 287 of 313 markers could be designed and a further 8 were surrogates. The EastGLH PRS was designed by the NHS East Genomic Laboratory Hub for use in a randomized control trial of women testing positive for an inherited pathogenic/likely pathogenic gene variant in BRCA1, BRCA2, PALB2, CHEK2, or ATM, using a NGS panel of 303 markers (12). The PRISMA PRS, designed as genotyping array of 268 markers (37 surrogates), was developed to provide multifactorial cancer risks to women attending genetic clinics in Spain. The eMERGE PRS consisted of 308 markers and is part of a large US study aiming to communicate PRS-based genome-informed risk assessment across multiple diseases (https://emerge-network.org). DBDS299, using data from the Danish Blood Donor Study (https:// bmjopen.bmj.com/content/9/6/e028401) is used in a research study to validate BOADICEA in the Danish population. In addition, we included the earlier PRS77 developed using BCAC data and comprising genome-wide significant SNPs, PRS3820 developed by Mavaddat and colleagues (3) using Lasso penalized regression, and two PRS (WISDOM75 and WISDOM120) developed for the WIS-DOM clinical trial (ref. 13; Clinical Trials identifier NCT02620852). We also considered all of the above PRS without 22_29203724_C_T and 22_29551872_A_G, SNPs correlated with CHEK2 Ã 1100delC, as described in Materials and Methods. Table 1 summarizes the estimated parameters for PRS313 and each of the alternative PRS. As expected, the 6 PRS that are variations on PRS313 have very similar effect sizes, expressed as log-OR per 1 SD, reflecting the fact that only a few variants are not accounted for. The a RL parameters for these 6 PRS are also similar, and only marginally lower than PRS313 estimate (0.441; 95% CI, 0.430-0.445). The effect sizes for PRS77 (both in terms of the log-OR per 1 SD and a) were smaller than for PRS313, while PRS3820 had larger effect sizes. The two WISDOM and PRISMA PRSs also had somewhat smaller effect sizes than PRS313. Removal of the 2 chromosome 22 SNPs had only a small effect on the estimated log-OR per 1SD, and afor example reducing a RL from 0.441 to 0.439 for PRS313. The a values computed using the simpler logistic regression approach (a GLM ) were smaller than those generated using the a RL for all PRS.

PRS parameters
We note that the a RL are approximately proportional to the PRS effect sizes, expressed as odds-ratio per 1 SD (Table 1; Fig. 1). Using the PRS313 as the standard, the predicted a value assuming proportionality is given by a APP ¼ 0:887h (Table 1; Fig. 1). These predicted values were very similar to the a RL values for all PRS.

Discussion
We evaluated approaches to incorporating alternate breast cancer PRSs into the risk prediction algorithm BOADICEA. The a values computed using the simpler logistic regression approach (a GLM ) were consistently smaller than those generated using the a RL , for all PRS. This difference can be explained by the fact that the logistic regression approach does not account for the residual component. Women with a high polygenic component are more likely to develop the disease at an early age, resulting in a negative correlation between the PRS and the residual polygenic component, which leads to an underestimation of the PRS effect size if the latter is not allowed for, a phenomenon related to index event bias (14).
We showed further that the a parameters derived from the log-OR estimate by assuming proportionality were very close to the a RL estimates. This suggests that this approach is likely to be reasonably accurate for other PRS, at least across the range of effect sizes considered here, providing a very straightforward approach to incorporating a PRS developed on another data set if a log-OR estimate is already available.
A striking observation is the very large difference in the means of the different PRS. This reflects the fact that the removal of a few SNPs with important weights can have a substantial effect on the mean. For example, the means for the PRS excluding the chromosome 22 SNPs are higher. While the mean has no intrinsic significance, this emphasizes the importance of correctly normalizing the PRS. In particular, because BOADICEA also incorporates the effects of CHEK2 protein truncating variants, we recommend using the PRS without these SNPs when gene-panel testing is performed.
It is important to note that estimates derived from European ancestry populations may not be applicable to individuals of other ancestries. The effect sizes may differ among populations, for example due to differences in linkage disequilibrium structure. This has been shown for PRS313, for which somewhat smaller effect sizes have been estimated in Asian and African-American populations (15)(16)(17)(18). In addition, the mean PRS can vary significantly by population-PRS313 has a higher mean in both Asian and African-American populations than in Europeans. This again emphasizes the importance of calibrating the PRS to the relevant population distribution. The argument that the a RL is preferable and provides a more reliable estimate of a should also hold in non-European populations.
The analyses used here adjusted the PRS for both the country in which the study was conducted and ancestry informative principal components. An adjustment is necessary because the mean PRS varies by country, even among European populations (and this is not reflected in differences in incidence rates). However, it is possible that adjustment for both country and principal components is over-conservative. Further analyses in large population-specific data sets may be able to address this. The approaches described allow BOADICEA to be adapted for use with any PRS in a consistent manner. However, it should be emphasized that the main validations of BOADICEA used PRS313 (19)(20)(21)(22)(23). For PRS that are substantially different, and particularly as more informative PRS are generated through larger GWAS, further prospective validation in independent external data sets will be required. We also note that the current formulation of BOADICEA assumes that the age-specific effects of the PRS and the residual polygenic component (as measured by the log-HR per 1 SD) are proportional. This significantly simplifies the algorithm, but it is possible that better predictions may be available by allowing differential age-specific effects.
The BOADICEA algorithm has been extensively validated, particularly when incorporating PRS313 (19)(20)(21)(22)(23) in addition to other risk factors. It is available through the CanRisk (www.canrisk.org) tool (24) and is widely used in the context of women with cancer family history or undergoing gene-panel testing, including several ongoing clinical implementation studies. The CanRisk tool provides the facility to incorporate a PRS as a Z-score, providing that the a parameter is known. The methods described here allow other PRSs to be used with BOADICEA via CanRisk and hence facilitate more widespread comprehensive breast cancer risk assessment.