A Genome-Wide Gene-Based Gene–Environment Interaction Study of Breast Cancer in More than 90,000 Women

Genome-wide association studies (GWAS) have identified more than 200 susceptibility loci for breast cancer, but these variants explain less than a fifth of the disease risk. Although gene–environment interactions have been proposed to account for some of the remaining heritability, few studies have empirically assessed this. We obtained genotype and risk factor data from 46,060 cases and 47,929 controls of European ancestry from population-based studies within the Breast Cancer Association Consortium (BCAC). We built gene expression prediction models for 4,864 genes with a significant (P < 0.01) heritable component using the transcriptome and genotype data from the Genotype-Tissue Expression (GTEx) project. We leveraged predicted gene expression information to investigate the interactions between gene-centric genetic variation and 14 established risk factors in association with breast cancer risk, using a mixed-effects score test. After adjusting for number of tests using Bonferroni correction, no interaction remained statistically significant. The strongest interaction observed was between the predicted expression of the C13orf45 gene and age at first full-term pregnancy (PGXE = 4.44 × 10−6). In this transcriptome-informed genome-wide gene–environment interaction study of breast cancer, we found no strong support for the role of gene expression in modifying the associations between established risk factors and breast cancer risk. Our study suggests a limited role of gene–environment interactions in breast cancer risk.


Introduction
Breast cancer is the most commonly diagnosed malignancy in women. In 2020, breast cancer was estimated to be newly diagnosed in 2.3 million women, and meanwhile caused more than 680,000 deaths worldwide (1). Both genetic and environmental factors have been found to contribute to the etiology of breast cancer. Twin studies have estimated that approximately 30% of variance in breast cancer incidence can be explained by genetic variation (2,3). Genome-wide association studies (GWAS) have identified more than 200 independent loci that are associated with breast cancer risk (4). However, these single-nucleotide polymorphisms (SNPs) only explain approximately 19% of the familial relative risk. Meanwhile, observational studies have demonstrated that several environmental and lifestyle risk factors, including age at menarche, body mass index (BMI), alcohol consumption, parity, and use of menopausal hormone therapy (MHT), also affect the risk of breast cancer (5)(6)(7)(8)(9)(10)(11). Exploring the interplay of genetic and environmental risk factors (GxE interactions) is thus crucial in understanding the development of breast cancer.
The Breast Cancer Association Consortium (BCAC) has published multiple studies which reported various interactions between individual SNPs and established risk factors. Nickels and colleagues reported potential interactions between genetic variants and several environmental and lifestyle factors, including number of full-term pregnancies, alcohol consumption, and ever being parous (12). Schoeps and colleagues reported that two SNPs on locus q. may interact with postmenopausal BMI to significantly affect the risk of breast cancer (13). However, other previous genome-wide gene-environmental interaction studies (GWEIS) reported no statistically significant interactions between SNPs and established breast cancer risk factors (4,(14)(15)(16)(17)(18)(19)(20). Statistical power remains one of the primary issues in GWEIS, as they require much larger sample sizes for detecting interactions as compared with marginal associations of similar magnitude (4,21).
Novel statistical methods, such as gene-based testing that incorporates functional information, can substantially reduce the burden of multiple comparisons. As most GWAS hits fall outside of the coding region of genes and are enriched in regulatory elements, it has been hypothesized that many GWASidentified genotype-phenotype associations are driven by the regulatory function on the expression of nearby genes (22)(23)(24). Wu and colleagues conducted a transcriptome-wide association study (TWAS) of breast cancer that systematically investigated the association between predicted gene expression and disease risk, and reported 48 statistically significant genes associations (25).
These results suggest that incorporating SNP-specific regulatory information on gene expression could help discovering meaningful GxE interactions.
In this study, we utilized the genotype and environmental risk factor data collected by the Breast Cancer Association Consortium (BCAC). Using breast tissue-specific transcriptome and genotype data from the Genotype-Tissue Expression (GTEx) project, we built gene expression prediction models for 4,864 genes with a significant heritable component. We then systematically assessed the interactions between these genes and 14 established risk factors in relation to the risk of breast cancer, using a mixed-effects score test called MiSTi (mixed-effects score test for interactions; ref. 26). Our study is the first to incorporate genetically determined gene expression data in the investigation of GxE interactions in breast cancer.

Study Sample
For this study, we obtained breast cancer cases and controls from the cohort studies and population-based case-control studies participating in BCAC. BCAC is a well-established, international collaborative consortium of 84 epidemiologic and clinical breast cancer studies, which is integrated by investigators interested in the inherited risk of breast cancer (4). Genotype data were generated using either the iCOGs or OncoArray genotyping platforms.
Details of the genotype calling, imputation, and quality control processes have been described elsewhere (67). Genotypes were imputed for all samples using the October 2014 (version 3) release of the 1000 Genomes Project dataset as the reference panel. The imputation was conducted using a two-stage approach, using SHAPEIT2 for phasing and IMPUTEv2 for imputation. Approximately 11.8 million SNPs with minor allele frequency (MAF) > 0.5% and imputation quality score (INFO) > 0.3 were included in our analysis.

Building the Prediction Model of Gene Expression
We used the RNA-sequencing and genotype data from 251 individuals published by the GTEx project version 7 to construct prediction models of gene expression in mammary tissue. Details of the GTEx project have been described elsewhere (68).
We built gene expression prediction models for each gene using the "FUSION" pipeline. Only the 1,217,312 SNPs included in the HapMap Phase 3 were included in building the prediction models. To estimate the genetically modulated expression of each gene, we included variants located within 500 kb on either side of the gene boundary. SNP-heritability of each gene was estimated using the REML algorithm implemented in the GCTA software (69). Gene expression models were constructed only if the SNP-heritability of gene expression was statistically significant at P < 0.01. Three prediction schemes, single best eQTL (Top1), LASSO regression, and elastic-net regression, were then utilized to build expression models for each heritable gene. The prediction accuracy of each derived model was then estimated using 5-fold cross-validation, and the best performing model was selected as the final model for each gene. We built gene expression prediction models for a total of 5,043 genes, of which we had breast cancer genotype data for 4,864 genes. The gene expression prediction models were then used as functional weights in the subsequent interaction analyses.

Collection of Breast Cancer Risk Factors
All demographic and breast cancer risk factor data were self-reported via interview or questionnaire prior to or shortly after breast cancer diagnosis (for cases) or the reference date (for controls, defined as the diagnosis date of matched breast cancer case). A total of 14 risk factors were included in the present analysis: age at first full-term pregnancy (per 5-year), average lifetime alcohol consumption (per 10 g/day), age at menarche (per 2-year), premenopausal BMI (per 5 kg/m 2 ), postmenopausal BMI (per 5 kg/m 2 ), breastfeeding history (yes/no), duration of breastfeeding (per 12-month), height (per 5 cm), history of oral contraceptive (OC) use (yes/no), parous (yes/no), number of full-term births (1/2/3/4+), current smoking status, current use of estrogen only (E-only) MHT, and current use of estrogen plus progestogen (E+P) MHT. BMI was analyzed separately for pre-and postmenopausal women, as the association between BMI and breast cancer risk varies across life stages (70). Analyses of reproductive factors were limited to parous women only and analyses of MHT use were limited to postmenopausal women.

Investigating Interactions between Predicted Gene Expression and Environmental Factors
We utilized a mixed-effects based analysis tool "MiSTi" (mixed-effects score test for interactions) to assess potential GxE interactions (26). MiSTi is a hierarchical model that assesses the joint interactions of a set of variants with environmental factors, by leveraging functional information across the variants. The GxE interaction is modeled by two components, one fixed and one random effects component. The fixed-effect component incorporates variantspecific functional information as weights to calculate the weighted burden of the variants, and then quantifies their interaction with the environmental factor. The random effects component involves any residual GxE interaction effect that cannot be addressed by the fixed effects. Here, the fixed effect component represents the interaction between predicted gene expression and the environmental factor, whereas the random effects component represents the residual interaction effects of any SNPs that were not accounted for in predicted gene expression. MiSTi includes a novel testing procedure, which derives two independent score statistics for the fixed effect and the random variance component separately and combines these two statistics through an adaptive weighted linear combination (aMiSTi) to assess the evidence of overall GxE interactions. The statistical power for GxE interaction analysis using MiSTi may be affected by multiple factors, including the LD structure of the gene, proportion of the variation in gene expression explained by the genetic regulatory variants, consistency of direction of effect between random and fixed effect, etc (71). Simulation analysis suggested that under type I error rate of 0.05, a sample size of 5,000 cases and 5,000 controls, for a gene harboring 100 genetic variants of which 27 were functional, MiTIi had an 81.3% of power to detect a significant GxE interaction using the aMiSTi approach when the fixed and random component had the same direction of interaction effect (26).
In each GxE interaction model, we adjusted for study, age (at diagnosis for cases; at reference date for controls), and first five principal components for population structure. For tests of current MHT use (E-only and E+P), we further adjusted for former use of the corresponding MHT (yes/no) in the model, to account for the association between former use of MHT (which attenuates with time since cessation) and breast cancer. To adjust for multiple comparisons, we considered any interactions with aMiSTi p-value < 0.05/(4,864 × 14) = 7.34 × 10 −7 as statistically significant. Because Bonferroni correction makes the strong assumption of independent tests and results in a stringent threshold for

AACRJournals.org
Cancer Res Commun; 2(4) April 2022 significance, we also report GxE interactions with a P value corresponding to a false discovery rate (FDR)<0.2 using the Benjamini-Hochberg (BH) approach as suggestive findings.

Data Availability Statement
The data generated in this study are available upon request from the corresponding author.

Results
The distribution of environmental factors in the study sample is summarized in   Supplementary Fig. S1. We observed an inflation of interaction test statistics for current use of E-only and E+P MHT and thus, any results for MHT use should be interpreted with caution. Overall, no interactions remained statistically significant after adjusting for number of tests performed using Bonferroni correction. The strongest evidence of interaction was observed for the Corf gene on chromosome 13 and age at the first full-term pregnancy ( Table   2, P GXE = 4.44 × 10 −6 ). The heritability of Corf expression was estimated to 0.21, based on 580 SNPs. However, the interaction was mainly driven by the random effects component (P = 1.03 × 10 −6 ) rather than fixed effects component (P = 0.62), which indicates there may be some SNP interaction effects that are beyond the predicted gene expression. Six additional GxE interactions were identified with an FDR-corrected P GXE < 0.2 ( Table 2). These included interactions between RP-D. (3q23) and age at menarche (P GXE = 1.60 × 10 −5 ); EML (2p21) and use of OCs (P GXE = 2.91 × 10 −5 ); history of breastfeeding and AC. (2q37.3, P GXE = 6.85 × 10 −5 ) and AKAP (12p13.32, P GXE = 3.58 × 10 −5 ) in parous women; smoking status and PMSP (7q11.23, P GXE = 4.00 × 10 −5 ), and RP-I. (11q14.1, P GXE = 6.94 × 10 −5 ).

Discussion
In this large transcriptome-informed investigation of GxE interactions in breast cancer, we systematically studied the interactions between predicted gene expression and fourteen behavioral and environmental risk factors. No interaction remained statistically significant after adjusting for number of tests.
However, we identified seven interactions between genes and environmental factors, including age at first full-term pregnancy, age at menarche, breast feeding history, smoking status, and use of OCs, as suggestive findings with FDR-corrected P < 0.20. Our findings did not support a significant role played by gene expression in modifying the associations between established risk factors and breast cancer risk.
The strongest interaction identified was between the Corf gene and age at the first full-term pregnancy. Corf, or LMODN, is a long noncoding RNA (lncRNA) located downstream of the LIM domain only protein 7 (LMO). Few studies have directly focused on the function of Corf gene. The expression of LMO has been found to play an important role in skeletal muscle transcription and cardiac development (72)(73)(74). Irregular expression of the LMO gene has been linked to multiple types of cancer, including breast, thyroid and lung (75)(76)(77)(78). Specifically, Hu and colleagues reported that the knockdown of LMO gene in the breast cancer cell line MDA-MB-231 could impair cell migration (76). In the same study, the upregulation of LMO was also found in the stroma of invasive breast carcinoma, which presumably correlated with the expression of serum response factors that regulate muscle and actin cytoskeleton functions. Epidemiologic studies have consistently shown the positive association between later age at first birth and higher incidence of breast cancer (79)(80)(81), which can at least be partially explained by pregnancy-induced changes in sex hormones. Earlier differentiation of mammary epithelium induced by estrogen and progestogen at pregnancy can reduce the susceptibility of neoplastic transformation and lower the subsequent disease risk (82). However, there is no direct evidence that this mechanism might interplay with the expression of Corf or LMO, and therefore, functional follow-up would be needed to explore this potential finding further.
Some of the six additional genes with an FDR-corrected P interaction < 0.2 identified in our study have previously been linked to breast cancer development.
The translocation and fusion of echinoderm microtubule-associated proteinlike 4 (EML) and anaplastic lymphoma kinase (ALK) have been implicated in various cancers. For example, the EML-ALK fusion has been observed in patients with non-small cell lung cancer (83)(84)(85), as well as in tumor samples from patients with breast and colorectal cancer (86). ALK gene was observed to amplify in most inflammatory breast cancer (IBC; ref. 87), a rare form of disease characterized by an early average age of diagnosis, aggressive histopathologic features, and poor survival (88). There is evidence that IBC cases has a higher prevalence of OC use than other breast cancer cases (89), which suggests that EML may interact with the effect of OC use through inflammatory-related pathways. AKAP is a member of A-kinase anchoring proteins, which has been recognized as a cancer-testis antigen for multiple types of cancer, including ovarian, hepatocellular, and colorectal (90)(91)(92). In an investigation of 162 AACRJournals.org Cancer Res Commun; 2(4) April 2022 tumor and normal tissues of breast, lack of AKAP expression was observed to be significantly associated with triple-negative breast cancer, breast tumor size, tumor stage, and 5-year disease-free survival (93). The PMSP gene has been suggested to interact through gene expression with PMS (94), a gene linked to poor survival from breast cancer (95). Noticeably, PMSP gene belongs to the mismatch repair (MMR) system, which has been observed to have a stronger effect among smokers in affecting colorectal cancer risk, relative to the never smokers (96). Further studies are needed to confirm these suggestive interactions and corresponding biological mechanisms with more direct evidence.
None of the suggestive interaction identified in our study has been observed by previous GxE studies of breast cancer. Otherwise, we were not able to replicate any significant interactions reported by the other studies, including for the genes harboring the variants with significant GxE interaction. This inconsistency could potentially be attributed to various reasons, such as different study populations, analysis approaches and importantly, adjustment for multiple testing. Given the huge number of tests (4,864 genes × 14 environmental risk factors) performed in our analysis, we performed a conservative Bonferroni correction approach and defined a threshold of P < 7.34 × 10 −7 as statistically significant. As this stringent threshold may yield false negative results, we further adopted a more liberal threshold and reported all GxE interactions with FDR-corrected P < 0.20 for each environmental factor.
Our study has several strengths. First, to our knowledge this is the first study to incorporate breast tissue specific gene expression models to inform our GxE interaction analysis. Previous research has suggested that breast cancer susceptibility loci are enriched in regulatory regions identified in breast tissue or cell lines (67,97). Based on this tissue specificity, we utilized genotype and gene expression data from mammary tissue to build gene expression prediction models, and used these models as prior information when assessing GxE interactions. By using a mixed-effects score test which enables the consideration of both fixed and random effects of the interaction, we were able to take into account the effect of genetic variants not involved in gene expression regulation. To avoid potential selection bias, we limited our study population to breast cancer cases and controls from population-based studies. However, our study was based on European ancestry women only, and thus our study conclusions may not be applicable to women with other ancestry. For certain suggestive GxE interactions detected, the results were mainly driven by the random effect component rather than the fixed effect, which made it challenging to explain the mechanisms or pathway underneath. A proportion of the studies included in our analysis adopted the case-control study design, which collected risk factor data based on self-report approaches. Consequently, the risk factor data, although centrally harmonized across all studies, might still be susceptible to recall bias. Our study did not stratify the breast cancer cases by menopausal or estrogen receptor (ER) status and investigate the subtype-specific GxE interaction, which may be a missed opportunity as the disease etiology differs across these subtypes. The results for current use of estrogen-only and estrogen plus progestogen MHT showed evidence of inflated type I error rates, indicating potential issues with distribution or modeling of those risk factors.
In conclusion, our study incorporated information on gene expression to investigate comprehensively the interactions between environmental risk factors and genetic variants on breast cancer risk using a mixed-effects score test approach. Our findings suggest a lack of evidence to demonstrate the role played by gene expression in modifying the associations between established risk factors and breast cancer risk.

Authors' Disclosures
X. Wang reports currently employed by Flatiron Health, Inc., and reports being a stockholder in Roche. Both of these relationships were after the completion of the submitted work, and were not in association with the submitted work.