Numerous common genetic variants have been linked to breast cancer risk, but they only partially explain the total breast cancer heritability. Inference from Nordic population-based twin data indicates rare high-risk loci as the chief determinant of breast cancer risk. Here, we use haplotypes, rather than single variants, to identify rare high-risk loci for breast cancer. With computationally phased genotypes from 181,034 white British women in the UK Biobank, a genome-wide haplotype–breast cancer association analysis was conducted using sliding windows of 5 to 500 consecutive array-genotyped variants. In the discovery stage, haplotype–breast cancer associations were evaluated retrospectively in the prestudy-enrollment data including 5,487 breast cancer cases. Breast cancer hazard ratios (HR) for additive haplotypic effects were estimated using Cox regression. The replication analysis included a prospective cohort of women free of breast cancer at enrollment, of whom 3,524 later developed breast cancer. This two-stage analysis detected 13 rare loci (frequency <1%), each associated with an appreciable breast cancer-risk increase (discovery: HRs = 2.84–6.10, P < 5 × 10–8; replication: HRs = 2.08–5.61, P < 0.01). In contrast, the variants that formed these rare haplotypes individually exhibited much smaller effects. Functional annotation revealed extensive cis-regulatory DNA elements in breast cancer–related cells underlying the replicated rare haplotypes. Using phased, imputed genotypes from 30,064 cases and 25,282 controls in the DRIVE OncoArray case–control study, 6 of the 13 rare–loci associations were found generalizable (odds ratio estimates: 1.48–7.67, P < 0.05). This study demonstrates the complementary advantage of utilizing rare haplotypes to capture novel risk loci and suggests the potential for the discovery of more genetic elements contributing to cancer heritability as large data sets of germline whole-genome sequencing become available.

Significance:

A genome-wide two-stage haplotype analysis identifies rare haplotypes associated with breast cancer risk and suggests that the rare risk haplotypes represent long-range interactions with regulatory consequences influencing cancer risk.

Common- and rare-variant models are two contrasting hypotheses under active debate regarding how germline genetic variants drive disease risk (1). In the past decade, efforts for identifying germline determinants of breast cancer risk have been largely led by genome-wide association studies (GWAS) of single-nucleotide polymorphisms (SNP), which have identified hundreds of common risk variants in large samples (2, 3). Polygenic risk scores comprised such GWAS-discovered risk variants have shown ability to grade women's breast cancer risk by several fold (3, 4). Inference from the Nordic twin data (5), however, clearly supports the rare-variant model over the common-variant counterpart, explaining the specific epidemiologic patterns of breast cancer in the largest, population-based twin study of cancer (Yasui and colleagues submitted as a separate manuscript), consistent with McClellan and King's hypothesis that genetic heterogeneity involving many rare mutations drives the risk of breast cancer and other common diseases (6).

GWAS methodologies for testing common-variant associations with disease risk are mature and have widely been applied. In GWAS, genotype imputation increases genomic coverage and allows researchers to combine multiple studies with potentially different genotyping platforms for higher statistical power (7). On the other hand, testing rare-variant associations with disease risk across the genome in a sample of unrelated subjects has been challenging, if not infeasible, because doing so requires sample sizes large enough to identify rare-variant carriers and a genotyping platform appropriate for rare variants. Whole-genome sequencing (WGS) data from a large cohort would enable studies of rare-variant associations. This important data collection is currently under way, but WGS data are still unavailable in large cohorts or case–control studies.

Here, we consider rare haplotypes, instead of rare, single variants, formed by sets of genotyped variants on an array in large epidemiologic studies for identifying rare, genetic determinants of breast cancer risk. We submit two reasons for considering rare haplotypes: multiple loci interact to exert biological function (e.g., those of regulatory regions to control gene expression); and each risk mutation must have arisen on a chromosome whose rare haplotypes may serve as proxies if unbroken by recombination over time (e.g., the mutation arose recently). Note, however, that this analysis does not consider haplotypes involving an ungenotyped variant and thus is far from a comprehensive examination of all rare haplotypes. Nonetheless, the novel rare haplotypes we identify provide further support for the rare-variant model and suggest consideration of haplotypes for the search for additional genetic contributions to breast cancer risk (and other common diseases).

Specifically, this paper reports genome-wide discovery and replication analyses involving haplotypes of variants genotyped in the UK Biobank (UKBB; ref. 8) with its Axiom Array for breast cancer–risk association in 181,034 “white British” women, including 9,011 breast cancer cases. The UKBB findings were further evaluated using the Discovery, Biology, and Risk Inherited Variants in Breast Cancer (DRIVE) project (9) for generalizability evaluation with 30,064 cases and 25,282 controls of females with European genetic ancestry.

Study subjects

For discovery and replication analysis, we utilized the data from the UKBB, a large cohort study that has collected extensive phenotypic and genetic data on approximately half a million participants across the United Kingdom (8). Upon accessing the UKBB resource under application number 44891, 181,034 women who self-reported as “white” and “British” with very similar genotype principal components (PC; the “white British ancestry subset” designated by UKBB) were included in our analysis (Supplementary Fig. S1). Breast cancer incidence (UKBB Data-Field: 40006; ICD-10 codes: C50.0–C50.9) and other clinically relevant variables were extracted from the UKBB portal. Women selected for this study had ages between 40 and 71 years old at the time of enrollment, of whom 5,487 were diagnosed with breast cancer before enrollment at an average age of 52.9 years [standard deviation (SD) = 7.4 years]. This group was used in our discovery analysis. During follow-up after UKBB enrollment, 3,524 additional women developed breast cancer with an average age of 61.2 years (SD = 7.7 years). This group was used in our replication analysis. The remaining 172,023 women without a breast cancer had an average age of 65.0 years (SD = 7.9 years) at the last follow-up with a median follow-up time of 8.5 years.

To evaluate the generalizability of our UKBB findings (“generalizability analysis”), we used subjects in the DRIVE study, which started in 2010 as part of the GAME-ON initiative (https://epi.grants.cancer.gov/gameon/) to detect new genetic loci for five common cancer types using the Illumina custom OncoArray (9). The DRIVE study comprised 17 studies and collected 60,015 participants from 15 countries. We accessed the DRIVE OncoArray data set through the NIH dbGaP (Study Accession: phs001265.v1.p1) and removed 4,669 (7.8%) participants based on our genotype quality control (QC) and limited the sample to those of European genetic ancestry (see below for details). A final set of 55,346 women, including 30,064 breast cancer cases and 25,282 breast cancer–free controls, was assembled as an independent, slightly broader genetic-ancestry population than UKBB's “white British” for assessing the generalizability of our findings from the UKBB discovery and replication analyses. All women had ages between 18 and 91 years old at the recruitment, and the breast cancer case group had an average age of onset of 61.7 years (SD = 10.7 years). For both UKBB and DRIVE, the control group was not necessarily cancer free, and all participants provided written informed consent under local IRB-approved protocols.

The UKBB-phased genotype data

The current study utilized the phased genotype data provided by the UKBB as the output of whole-chromosome phasing during the genotype imputation process. Details regarding the UKBB's Axiom Array, haplotype estimation, and genotype imputation have been previously described by UKBB scientists (8). Briefly, the phased data downloaded from UKBB included 487,409 participants phased on a per-chromosome basis at 658,720 biallelic autosomal genotyped variants with SHAPEIT3 (10) using the 1,000 Genome Phase 3 haplotypes (11) as the reference. Multiallelic variants were removed by the UKBB's QC (8). We converted the supplied phased data in the BGEN v1.2 format to a new data structure, called GDS (12), to facilitate genome-wide haplotype analysis. Afterward, we created a subset of the phased data for the 181,034 women selected for our analysis. Women were excluded if one: (i) was identified as an outlier in heterozygosity; (ii) showed any sex chromosome aneuploidies; (iii) was related to other participants, as inferred by KING (13); or (iv) withdrew from the UKBB before this study began. Because the phased data had a 100% genotyping rate on all variant sites, the only exclusions we made were 12,274 variants that had zero minor allele frequency (MAF) or failed the Hardy–Weinberg equilibrium (HWE) test (i.e., P < 1 × 10−10) in our analytic subset. A final set of 646,446 variants, including 643,307 SNPs and 3,139 insertion/deletions, remained for analysis. The above data processing was implemented with PLINK (14) and R package “SeqArray” (12).

The DRIVE OncoArray data and genotype imputation

We obtained the DRIVE OncoArray data, containing 528,620 variants genotyped in 60,015 breast cancer cases and controls, from dbGaP under our approved project #28544. Given the limited overlap of variants between the UKBB's Axiom Array and the DRIVE's OncoArray, we first filtered the DRIVE genotype data through variant- and sample-QCs, as described below, and then performed genotype imputation to improve DRIVE coverage of the UKBB's genotyped variants. Variant-QC filters included (i) missing genotyping rate >10%; (ii) MAF = 0, and (iii) HWE test P < 1×10−10. Samples were excluded if one: (i) had a genotyping rate <90%; (ii) failed in the PLINK's sex check (i.e., F-score above 0.3); (iii) was related to any other samples with second-degree or closer relationship (Supplementary Fig. S2A), as detected by KING (13); or (iv) separated from European-ancestry samples in the 1000 Genomes Phase 3 data (11), according to the first two PCs computed by flashPCA (15). Women identified as of European ancestry in this study had PC1 ≤ 0.0025 and PC2 ≥ −0.007, as indicated by vertical and horizontal dotted lines in Supplementary Fig. S2B. The cleaned data contained 433,297 variants and 55,450 unrelated women of European genetic ancestry. Besides, we performed the second PCA on the 55,450 women and the derived top 10 PCs were used to control the fine-scale population structure.

Genome-wide imputation of the cleaned DRIVE genotype data was carried out using the online Trans-Omics for Precision Medicine (TOPMed) Imputation Server (see URL below), which automatically executes QC, phasing with Eagle2 (16), and genotype imputation with Minimac4 (17) on a per-chromosome basis. The pipeline was built on a large reference panel of 194,512 haplotypes and 308 million variants generated by the TOPMed program of the US National Heart, Lung, and Blood Institute (18). Due to the sample limit of the server, the QCed DRIVE OncoArray data were split randomly into three subsets. Each subset contained 15,000 to 20,000 samples and was submitted for genome-wide imputation. After obtaining three sets of imputed data, we calculated imputation score for each variant among the 55,450 women according to the Minimac4 formula (17): | $Rsq\ = \ ( {1/2n} ) \times \mathop \sum _{i\ = \ 1}^{2n} {( {HD{S}_i - \hat{p}} )}^2/[ {\hat{p}( {1 - \hat{p}} )} ]$ |⁠, where 2n denotes the total number of phased chromosomes with the variant (among n women), HDSi indicates the estimated haploid alternate allele dosage (between 0 and 1) for the variant on the ith chromosome, and | $\hat{p}$ | is the observed alternate allele frequency. Rsq ranges from 0 to 1 with values close to 1 indicating high imputation quality. Variants with Rsq ≥ 0.8 were used for our haplotype analysis. Moreover, 104 women of unknown age (code = 888) were removed before the analysis. In the end, we generated a subset of 615,181 phased imputed variants on 55,346 women, containing 95% of UKBB-phased variants, for the generalizability analysis.

Genome-wide haplotype analysis: discovery and replication

Genome-wide haplotype analysis for breast cancer risk was carried out using the UKBB-phased genotypes under the following discovery-replication framework. In the discovery stage, haplotypes were directly constructed from a set of consecutive genotyped variants in windows of fixed size W = 5, 10, 20, 30, 50, 100, 250, and 500 variants. For each chromosome, we shifted the window by one variant in the 5′ → 3′ direction. As such, our analysis was designed to cover every set of 5–500 consecutive genotyped variants. Each haplotype was coded as 0, 1, or 2 copies per person in the same way as SNPs, and its additive effect was tested statistically. The number of haplotypes observed in a window varied depending on the window size and the underlying linkage disequilibrium (LD) pattern. In the discovery analysis, each haplotype was tested for association with breast cancer risk in the preenrollment portion of the UKBB data, consisting of 5,487 breast cancer cases and 175,547 women free of breast cancer at study enrollment. To reduce the unnecessary computational burden, our discovery analysis tested only the haplotypes with frequency ≥ 0.1% (10 per 10,000 chromosomes) in the 5,487 preenrollment cases. For each haplotype, the hazard ratio (HR) of developing breast cancer was estimated by multivariable Cox regression with age as the time axis and right censoring at the study enrollment, controlling for the first 10 genotype PCs. The analysis was implemented using R packages “survival” (see URL below) and “SeqArray” (12). Statistical significance for the discovery analysis was set at Wald test P values of Cox regression (Cox P) < 5×10−8. The genome-wide scan was performed in parallel on 22 autosomes with chunks of 500 windows for each fixed size. All runs were completed on our institutional high-performance computing cluster with 20 CPU cores per chromosome and 4 Gb memory per core.

In the replication analysis, haplotypes that met the statistical significance threshold of the discovery analysis were evaluated in the postenrollment part of the UKBB data, excluding the 5,487 preenrollment breast cancer cases. This replication analysis included 3,524 postenrollment-onset breast cancer cases and 172,023 women without a breast cancer at the last UKBB follow-up before our UKBB data access. Haplotypes’ breast cancer HRs were estimated by the same multivariable Cox model, taking age as the time axis starting from the study enrollment to the last follow-up and controlling for the first 10 genotype PCs. The replication statistical significance threshold was set at Cox P < 0.01: see below for an assessment of false discovery/replication by the statistical significance thresholds we used.

Lead haplotype among correlated haplotypes

Because of our use of overlapping windows, many discovered/replicated haplotypes are highly correlated/redundant. To consolidate each set of these correlated haplotypes into a single risk locus (or region) represented by a “lead haplotype,” we merged replicated haplotypes by LD-based clumping using PLINK (14) with an LD threshold of r2 > 0.1 and a physical distance threshold of 500 kb. The position of the first variant on each replicated haplotype was used as the haplotype's genomic position. Independent risk loci/region were determined as the mutually exclusive clumps formed by the clumping procedure. We calculated the overall HR and Cox likelihood-ratio test (LRT) P value for each replicated haplotype, using the entire set of UKBB data (combining the discovery and replication data into a total of 9,011 breast cancer cases and 172,023 breast cancer–free women). Within each risk locus/region, the haplotype exhibiting the smallest P value was considered the “lead haplotype.”

Conditional analysis of rare haplotypes

To test whether the identified rare haplotypes were independent of known (common) risk variants, we searched for any GWAS variants reaching P < 5 × 10−8 in the recent meta-GWAS on the overall breast cancer (2) within ± 500 Kb of each rare haplotype. Then, we estimated the rare haplotypes’ effects conditioned on their respective GWAS variants most strongly associated with breast cancer risk, in the 181,034 UKBB women by Cox regression. Because many GWAS variants were not present in the UKBB-phased data, we imputed those variants by using the TOPMed Imputation Server and used well-imputed variants (i.e., minimac Rsq ≥ 0.80) for the conditional analysis. To explore whether the rare haplotypes delineated a specific combination of single variants, we applied haplotype analysis using known GWAS variants to identify any GWAS variants–derived haplotypes associated with breast cancer risk in 181,034 UKBB women. Statistically significant haplotypes were identified by a forward stepwise regression at Cox P < 0.05. Afterward, we tested the coexistence between the rare haplotypes we found in the original haplotype analysis and the GWAS variants–derived haplotypes by the Fisher exact test.

Exploration of false discovery rate by permutation experiment

Because we used highly overlapping windows and tested many correlated/redundant haplotypes, it is not clear whether our statistical significance thresholds, P < 5 × 10−8 for discovery and P < 0.01 for replication, adequately controlled the false discovery rate. To assess the extent of potential false discovery in the discovery-replication framework with the UKBB data, we conducted a permutation experiment. Thirteen permuted data sets were generated in which the phenotype vectors (i.e., the vectors of breast cancer diagnosis status, age variables, and genotype PCs) were randomly shuffled among the 181,034 women and used with their original unshuffled genetic data vectors. This simulated the null hypothesis condition where there is no association between breast cancer and haplotypes, except by chance, while keeping the original associations of age and the PCs with breast cancer. For each permuted data set, we repeated the same two-stage haplotype analysis we conducted above, except for W = 500 variants (there was no replicated haplotype of W = 500 in the original analysis). The number of haplotypes that met the same statistical significance thresholds (P < 5×10−8 for discovery and P < 0.01 for replication) and the number of independent risk loci/region after the LD-based clumping were used as approximate estimates of the numbers of false discoveries we might expect under the null hypothesis of no association.

Individual variant association with breast cancer risk

We performed a standard GWAS analysis to assess individual variants’ associations with breast cancer risk using the UKBB data. For each variant, the breast cancer–genotype association was assessed under the additive model using multivariable Cox regression, in the UKBB data of 9,011 breast cancer cases and 172,023 breast cancer–free women. Age was taken as the time axis and the model controlled for the first 10 genotype PCs. The GWAS was completed parallelly on 22 autosomes in chunks of 200 variants using 10 CPU cores and 4 Gb memory per core.

Generalizability analysis

In assessing the generalizability of the replicated findings, a large, independent breast cancer data set with phased genotype data that contained the UKBB genotyped variants would have been ideal. In the absence of such a data set, we used the cleaned DRIVE breast cancer case–control data for generalizability evaluation with statistical phasing and imputation performed as described above. Specifically, we assessed the lead haplotype of each risk locus/region identified by our UKBB two-stage discovery/replication analysis in the QCed DRIVE imputed data including 30,064 breast cancer cases and 25,282 breast cancer–free controls.

Because statistical phasing and imputation of the DRIVE data are subject to errors, the following strategy was used. First, we utilized pairs of “HDS” values (“estimated phased haploid alternative allele dosages”) available for each variant of each woman in the final phased/imputed DRIVE data. The two HDS values for a variant of a woman denote the two probabilities of the alternative allele on her two phased haplotypes. The alternative allele was called when the HDS value was above 0.5; otherwise, the reference allele was called. Second, we reduced the number of variants required to define each lead haplotype to avoid the impact of potential imputation errors in unnecessary variants. Specifically, for the ith replicated lead haplotype defined by Wi consecutive UKBB variants, we included the subset of Wi variants that passed the imputation-accuracy filter (Rsq ≥ 0.80) and applied tree-based recursive partitioning, implemented by R package “rpart,” to reduce the number of variants necessary to define the lead haplotype to miWi variants, with the complexity parameter (CP) 10−10 and the minsplit parameter 40 (and the other parameters at their default values). At one locus/region (locus #19) where the reduced variant set did not approximate the lead haplotype well, we relaxed the minimac filter to Rsq ≥ 0.3 for common variants with a MAF ≥ 1% and reran the variant reduction for the new lead haplotypes.

The evaluation scheme above (hereafter referred to as “original haplotype calls”) was our primary scheme for defining each DRIVE woman's diplotypes of lead haplotype i. which can be expressed as:

where “Allele” is the code (0 or 1) for the allele of the lead haplotype, and the first index “1” or “2” of HDS denotes the woman's first or second chromosomes. Note that |HDS1j − Allelej| and |HDS2j − Allelej| indicate imputation uncertainty of the jth variant of the reduced variant set of the ith lead haplotype, with possible values between 0 and 1. Values close to 0 and 1 indicate high confidence in the agreement or disagreement, respectively, of the imputed allele with the allele in the lead haplotype, with the boundary 0.5 corresponding to an ambiguous imputation result.

To account for the imputation uncertainty in the generalizability analysis more conservatively, we supplement our primary scheme with the following subschemes for defining diplotypes in DRIVE. The secondary scheme, “high-confidence haplotype calls,” considered only variants imputed with high confidence by using a smaller value of c (considering c in the range 0–0.01 with an increment of 0.001 and 0.01–0.50 with an increment of 0.01) and reporting the most statistically significant association. The tertiary scheme, “high-confidence haplotype calls with one exception,” removed one variant from the reduced variant set of the lead haplotype and applied the secondary scheme to the remaining variants in the set (with the excluded variant still satisfying |HDS − Allele| < 1).

The primary, secondary, and tertiary schemes were applied sequentially. The association between each imputed lead haplotype and breast cancer case/control status in DRIVE was assessed by logistic regression, adjusting for age and the first 10 genotype PCs, where LRT P < 0.05 served as the statistical significance measure of the association. The Fisher exact test was also conducted to supplement the LRT as the frequencies of lead haplotypes were low.

Functional annotation

To explore the biological functions of each breast cancer–risk haplotype, we annotated variants on the rare lead haplotypes that met all three statistical significance thresholds of the discovery, replication, and generalizability evaluation, using a variety of functional data sets (see Data Availability below for URLs). UCSC LiftOver chain files were used to convert genomic coordinates between hg19 (i.e., DRIVE OncoArray data) and hg38 (i.e., DRIVE TOPMed-imputed data). Gene-based annotations, such as nearest genes and functional consequences, were extracted using WGSA v0.85 (19) with GENECODE v37. Regulatory variants were annotated by searching the following resources: (i) cis-eQTLs reported in the GTEx v8 database, passing 5% false discovery rate in the breast mammary tissue (20); (ii) human active enhancers in the HACER (21) and enhanceratlas2 (22), limiting to five breast cancer–related cell/tissues (i.e., MCF7, MCF10A, HCC1806, HMEC, and T47D); (iii) enhancers experientially characterized by a STARR-seq in the MCF7 cell line (ENCODE accession: ENCSR547SBZ), where significant enhancer peaks were defined by P < 0.05; (iv) topologically associating domains (TADs) and chromatin loops predicted by the 3D Genome browser, where variants with putative regulatory roles were considered within 20 Kb of a TAD boundary (23) or a chromatin-loop anchor regions where DNA segments physically interact (24); and (v) DNA-binding motifs from HaploReg v4 (25), where we considered variants to be regulatory if the absolute change of the reported LOD score between reference and alternative alleles (|LODalt − LODref|) was above 3, with a positive value of LOD change indicating that the alternate allele likely resulted in stronger motif binding and a negative change indicating that the alternative allele likely caused weaker motif binding. We combined the above annotations to summarize potential functional features at the haplotype level.

Data availability statement

The data generated in this study are available within the article and its Supplementary Data files. Individual-level genotype and phenotype data are publicly available by submitting request to the UK Biobank and dbGaP (study accession: phs001265.v1.p1). Analytic scripts and codes are available at https://github.com/fwplace/HapProj.

Public databases and software used in this study include:

Genome-wide haplotype analysis using the UKBB-phased data

The overall design of our genome-wide haplotype analysis is shown in Fig. 1A. The discovery analysis included 181,034 female participants who were classified as “white British” by the UKBB (see Methods for inclusion/exclusion criteria). All women were biologically unrelated and had a homogeneous genetic background classified as a specific subgroup of European genetic ancestry in comparison with the 1000 Genomes Phase 3 European population samples (Supplementary Fig. S1). The set of phased haplotypes from UKBB for the 181,034 women contained 646,446 genotyped variants, enabling the construction of haplotypes genome-wide through sliding windows of 5, 10, 20, 30, 50, 100, 250, and 500 consecutive genotyped variants. For each window size, 4.8–88.4 million haplotypes that had frequency ≥ 0.1% (i.e., 10 per 10,000 chromosomes) in 5,487 preenrollment breast cancer cases were identified and individually tested for breast cancer-risk associations by Cox regression with age as the time axis and right censoring at study enrollment, adjusting for the first 10 genotype PCs. The genome-wide scans detected 5,858 haplotypes across the eight fixed-size sliding windows at P < 5×10−8 (Table 1).

Figure 1.

The overall design of genome-wide haplotype analysis for breast cancer. A, Flowchart of the two-stage haplotype analysis using the UK Biobank data set and generalizability evaluation using the DRIVE data set. B, Counts of common (frequency ≥ 1%) and rare (frequency < 1%) that met the discovery (P < 5×10−8) and replication (P < 0.01) statistical significance threshold in the UK Biobank phased data under various window sizes of W = 5–500 consecutive genotyped variants.

Figure 1.

The overall design of genome-wide haplotype analysis for breast cancer. A, Flowchart of the two-stage haplotype analysis using the UK Biobank data set and generalizability evaluation using the DRIVE data set. B, Counts of common (frequency ≥ 1%) and rare (frequency < 1%) that met the discovery (P < 5×10−8) and replication (P < 0.01) statistical significance threshold in the UK Biobank phased data under various window sizes of W = 5–500 consecutive genotyped variants.

Close modal
Table 1.

Summary of the two-stage haplotype analysis results using the UK Biobank phased data.

Window sizeaEffective windowsGenomic lengthDiscovery analysisReplication analysis
Coverage, %bTotal testscP < 5×10−8P < 0.01No. of locid
W = 5 646,328 17.2 Kb 41.272 4,802,633 90 81 
W = 10 646,248 38.8 Kb 20.420 12,237,248 117 80 
W = 20 646,028 81.8 Kb 8.350 34,466,021 215 61 
W = 30 645,808 124.9 Kb 4.338 58,110,973 473 21 
W = 50 645,368 211.0 Kb 1.611 88,427,869 995 15 
W = 100 644,263 426.4 Kb 0.338 85,988,703 1,769 117 
W = 250 618,760 1.08 Mb 0.034 25,950,002 1,673 61 
W = 500 283,807 2.52 Mb 0.007 3,426,133 526 
Window sizeaEffective windowsGenomic lengthDiscovery analysisReplication analysis
Coverage, %bTotal testscP < 5×10−8P < 0.01No. of locid
W = 5 646,328 17.2 Kb 41.272 4,802,633 90 81 
W = 10 646,248 38.8 Kb 20.420 12,237,248 117 80 
W = 20 646,028 81.8 Kb 8.350 34,466,021 215 61 
W = 30 645,808 124.9 Kb 4.338 58,110,973 473 21 
W = 50 645,368 211.0 Kb 1.611 88,427,869 995 15 
W = 100 644,263 426.4 Kb 0.338 85,988,703 1,769 117 
W = 250 618,760 1.08 Mb 0.034 25,950,002 1,673 61 
W = 500 283,807 2.52 Mb 0.007 3,426,133 526 

aW, the number of consecutive genotyped variants that define the haplotypes. Each haplotype window is shifted by one variant.

bCoverage represents the number of haplotypes analyzed over the total number of haplotypes found across the genome given a fixed window size.

cThe number of haplotypes meeting the criterion of frequency ≥0.1% among 5,487 preenrollment breast cancer cases.

dThe independent loci/regions were determined by PLINK LD clumping with r2 > 0.10 and a maximum distance of 500 kb between the lead haplotype and the other correlated haplotypes.

The 5,858 haplotypes from the discovery analysis of the retrospective component of the UKBB data were evaluated in the following replication analysis among the breast cancer–free women at their UKBB enrollment (i.e., the noncases in the discovery analysis), following them prospectively for breast cancer incidence starting from the enrollment. Note that although the replication analysis used the same UKBB cohort, it is independent of the discovery analysis because all women in the replication analysis were noncases in the discovery analysis, were breast cancer free at the start of the prospective replication analysis, and the replication analysis utilized case/noncase data that emerged after the UKBB study enrollment, i.e., after the completion of case/noncase data utilized in the retrospective discovery analysis. Using the same Cox regression with 3,524 postenrollment breast cancer cases among 175,547 women breast cancer free at the enrollment, 436 haplotypes replicated at P < 0.01 (Table 1). These replicated haplotypes were split into two mutually exclusive groups: one containing 243 common short haplotypes with a window size of 5 to 30 variants, and the other containing 193 rare long haplotypes with a window size of 50 to 250 variants (Fig. 1B). For windows of 500 variants, no haplotype replicated.

To explore the false discovery rate of our discovery-replication analyses of haplotypes with the UKBB-phased data, we ran 13 permutation experiments by randomly permuting the breast cancer status of 181,034 women together with age and genotype PCs, while keeping the original genetic data, to create the null hypothesis condition of no breast cancer–haplotype association. Then we performed the identical genome-wide discovery and subsequent replication analyses of the permuted data set with the same statistical analysis and significance thresholds. Although the permuted discovery analysis yielded similar numbers of genome-wide significant haplotypes as the original discovery analysis, indicating that many of the “original-analysis discoveries” are likely false positives, the subsequent replication analysis showed very few replications, indicating the replicated haplotypes of the original analysis are mostly valid (Supplementary Table S1).

Prioritization of 13 rare haplotype loci for breast cancer risk

To remove redundancies in the replicated haplotypes, we applied LD clumping to the 436 replicated haplotypes to consolidate each set of correlated/overlapping haplotypes into a single risk locus/region. After LD clumping, the replicated haplotypes were grouped into 20 distinct genetic loci/regions, each being represented by a “lead haplotype” whose breast cancer–association P value, computed by the combined Cox-regression analysis of 9,011 (preenrollment + postenrollment) breast cancer cases and 172,023 breast cancer–free controls, was the smallest among the replicated haplotypes of the same locus/region (Supplementary Table S2). Of the 20 risk loci ranked by the replication P values of the lead haplotypes, seven (loci #1–4, 7, 8, and 10) were mapped by common haplotypes with a window size of 5 variants and genomic length of 5.2 to 42.2 kb. The seven common lead haplotypes (i.e., those of FGFR2 on 10q26.3, CASC16 on 16q12.1, DIRC3-AS1 on 2q35, LINC01488/CCND1 on 11q13.3, C5orf67/MAP3K1 on 5q11.2, NEK10 on 3p24.1, and CASC8/CASC21 on 8q24.21) conferred a modest breast cancer risk among 181,034 women (HR = 1.11–1.30, P = 2.3×10−67–4.3×10−12). These common haplotypes appeared to be tagged by their respective lead variants of standard GWAS analyses (HR = 1.11–1.30, P = 4.1×10−66–2.6×10−5) or be in tight LD with known GWAS hits (Supplementary Table S3).

In contrast, the remaining 13 loci (loci #5, 6, 9, and 11–20) were mapped by rare haplotypes with larger window sizes covering longer genomic regions and gene regulatory elements (Table 2 and Fig. 2AN). The 13 rare lead haplotypes exhibited relatively large breast cancer risk with HRs of 2.84 to 6.10 in the discovery analysis and HRs of 2.08 to 5.61 in the replication analysis. From the standard GWAS analysis using the UKBB-phased data, no variants within ±500 kb of the 13 rare haplotypes were found to pass the genome-wide significance threshold (i.e., P < 5×10−8) for association with breast cancer risk, with the lowest P values ranging between 4.1×10−4 and 0.040. The rare haplotype on 22q12.1 (locus #5), formed by 250 consecutive genotyped variants spanning multiple genes (e.g., MN1, PITPNB, TTC28, CHEK2, and KREMEN), demonstrated the most statistically significant association with breast cancer risk (frequency = 0.13%, HR = 3.49, P = 1.5×10−15; Fig. 2A). The lead SNP rs35313550 on the rare haplotype individually conferred a modest breast cancer risk (HR = 0.85, P = 4.1×10−4) among the 181,034 UKBB women and has been reported in the recent meta-GWAS [odds ratio (OR) = 1.19, P = 1.0×10−18; ref. 2]. By searching the meta-GWAS data (2), we found 244 GWAS variants in the region of locus #5, of which 217 were well imputed (i.e., minimac Rsq ≥ 0.80) in UKBB women. The rare haplotype remained associated with breast cancer risk (HR = 2.96, P = 1.0 × 10−18) when conditioning on the meta-GWAS's most significant variant rs62237573 (GWAS-reported P = 2.0×10−41; Supplementary Tables S4 and S5) among the 217. To explore if the rare haplotype delineated a specific combination of single variants, we performed haplotype analysis using nine of the 217 variants (“GWAS9”) that were on the rare 250-variant haplotype. Three GWAS9-derived haplotypes were significantly associated with breast cancer risk at P < 0.05, and the strongest haplotype association was with a common haplotype GWAS9.h7 (HR = 1.18, P = 0.001; Supplementary Table S6). The rare haplotype we reported coexisted with GWAS9.h7 in 249 women, accounting for 3.5% of the GWAS9.h7 carriers. These 249 women were at a 3-fold higher risk than those with GWAS9.h7 only (HR = 3.31 vs. 1.10; Supplementary Table S7). Next, we expanded the analysis to the 217 single variants in the region (“GWAS217”). Nine GWAS217-derived haplotypes were detected at P < 0.05, and the strongest association was a rare haplotype GWAS217.h57 (HR = 2.96, P < 10−20). The rare haplotype coexisted with the GWAS217.h57 in 246 women, accounting for 43.6% of the GWAS217.h57 carriers. Women presented with both rare haplotypes had a 1.4-fold higher risk than those with GWAS217.h57 only (HR = 3.52 vs. 2.52; Supplementary Tables S6 and S7).

Table 2.

Identification of 13 rare haplotype loci for breast cancer risk in the UK Biobank phased data.

Identification of 13 rare haplotype loci for breast cancer risk in the UK Biobank phased data.
Identification of 13 rare haplotype loci for breast cancer risk in the UK Biobank phased data.
Figure 2.

Genetic association plots of 13 breast cancer rare haplotypes identified in the UK Biobank data. A–M, Plots of the 13 loci with rare haplotypes that met the discovery (P < 5×10−8) and replication (P < 0.01) statistical significance thresholds in the UK Biobank phased data (also listed in Table 2). Each panel is labeled by its lead haplotype (“Chromosome: Start position – End position”). The left y-axis indicates association P values for the original haplotype (orange segment) the reduced haplotype (purple segment), and individual variants from the standard GWAS analysis (black dots), calculated by Cox regression using all “white British” UK Biobank women. The genomic coordinates shown on the x-axis are based on GRCh37. The brown line shows the recombination rate computed by Bherer et al. (26) among European females, as indicated on the right y-axis. The bottom annotation tracks show locations of TADs (green segments), chromatin loops (magenta curves), and genes collected from 3D Genomes Browser and GENECODE v37. N, the fraction of functionally annotated variants on rare haplotypes (see Materials and Methods for details).

Figure 2.

Genetic association plots of 13 breast cancer rare haplotypes identified in the UK Biobank data. A–M, Plots of the 13 loci with rare haplotypes that met the discovery (P < 5×10−8) and replication (P < 0.01) statistical significance thresholds in the UK Biobank phased data (also listed in Table 2). Each panel is labeled by its lead haplotype (“Chromosome: Start position – End position”). The left y-axis indicates association P values for the original haplotype (orange segment) the reduced haplotype (purple segment), and individual variants from the standard GWAS analysis (black dots), calculated by Cox regression using all “white British” UK Biobank women. The genomic coordinates shown on the x-axis are based on GRCh37. The brown line shows the recombination rate computed by Bherer et al. (26) among European females, as indicated on the right y-axis. The bottom annotation tracks show locations of TADs (green segments), chromatin loops (magenta curves), and genes collected from 3D Genomes Browser and GENECODE v37. N, the fraction of functionally annotated variants on rare haplotypes (see Materials and Methods for details).

Close modal

Similarly, locus #15, represented by a 100-variant haplotype near FRY/BRCA2 (length = 361 kb, HR = 4.11, P = 2.8 × 10−10; Fig. 2H), mapped to a 361-Kb region containing 4 GWAS variants (2). The rare haplotype we reported in locus #15 remained associated with breast cancer risk (HR = 3.76, P = 2.1×10−8) conditioning on the most significant single variant rs11571833 (GWAS-reported P = 2.6 × 10−17) of the 4 in the region (2). Haplotype analysis using the 4 single variants (“GWAS4”) detected a rare haplotype GWAS4.h3 modestly associated with breast cancer risk (HR = 1.24, P = 0.008). The rare 100-variant haplotype coexisted with GWAS4.h3 in 106 women who were at a 3.7-fold higher risk than those with GWAS4.h3 only (Supplementary Tables S8 and S9). Three rare haplotypes (loci #14, 16, and 18) contained variants previously reported at suggestive level (2, 3, 27, 28): GRIN2A/ATF7IP2 on 16p13.2 (50 variants, length = 106 kb, HR = 2.93, P = 2.2×10−10), AC107463.1/LINC02485 on 4q28.3 (100 variants, length = 656 kb, HR = 4.42, P = 9.4×10−10), and KIAA1217 on 10p12.2 (100 variants, length = 523 kb, HR = 4.44, P = 2.6×10−10; Fig. 2G, I, and K). To our knowledge, the remaining rare haplotypes (loci #6–13, 17, and 19–20) have not been identified for breast cancer risk, they are C1QTNF3-AMACR/RAI14 on 5p13.2 (100 variants, length = 658 kb, HR = 5.91, P = 5.0×10−14); LSAMP on 3q13.31 (100 variants, length = 428 kb, HR = 4.54, P = 4.4×10−11); FANC1/POLG on 15q26.1 (250 variants, length = 772 kb, HR = 5.18, P = 4.9×10−11); GOT2/SLC38A7 on 16q21 (100 variants, length = 519 kb, HR = 4.75, P = 1.5×10−10); LINC02141/AC00981.2 on 16q21 (100 variants, length = 487 kb, HR = 4.70, P = 2.0×10−10); AC073062.1/AC016730.1 on 2p24.3 (50 variants, length = 211 kb, HR = 2.53, P = 2.1×10−9); MYB/AHI1 on 6q23.3 (100 variants, length = 493Kb, HR = 3.68, P = 7.5×10−11); and PARD3 on 10p11.21 (250 variants, length = 1.1 Mb, HR = 3.62, P = 3.1×10−11; Fig. 2BF, J, and LM).

To characterize the 13 rare loci, we analyzed the underlying LD-block patterns using a graphical-model–based LD partition method Big-LD (29). The Big-LD detected LD blocks (formed by a subset of highly correlated variants) and orphan variants (those not associated with the other variants in the region) for each of the 13 rare loci (Fig. 3AM). The frequencies of haplotype segments formed by variants in single LD blocks are often higher than those of the original rare haplotypes, indicating that the rare lead haplotype may have arisen from multiple LD blocks via recombination, except for locus #13, which might have arisen from a single small LD block (Fig. 3F). We tried to identify critical regions of each rare lead haplotype by excluding 20% of the consecutive variants and evaluating breast cancer–risk associations with truncated haplotypes that were formed by the remaining variants. We found that exclusion of variants on 5′ and 3′ ends led to a reduction of statistical significance of the identified rare haplotypes’ associations with breast cancer risk (Supplementary Fig. S3 and Supplementary Table S10). Moreover, functional annotation with public databases revealed that a large proportion of variants on the rare lead haplotypes exhibited putative roles in regulating gene expression, enhancer activity, or 3D chromatin interactions (Supplementary Table S11). Of note, many annotated variants resided in the anchor regions of human chromatin loops and were predicted to alter the bindings of DNA motifs (Fig. 2N).

Figure 3.

LD heat map plots of 13 rare haplotypes identified in the UK Biobank data. A–M, 13 lead risk haplotypes in Table 2. In each panel, the upper heat map plot displays pairwise D’ (top triangle) and r2 (bottom triangle) values computed using phased genotypes of 181,034 women. The heat map color key is shown in the last row. The bottom plot depicts the estimated LD structure calculated by the Big-LD algorithm [Kim and colleagues (29)] with |r| ≥ 0.50and a maximal distance of 1 Mb between any two correlated variants. The y-axis shows the frequencies of truncated rare haplotypes (whose alleles match those of the original haplotype; cyan segments) and orphan variants (which were not in LD with other variants; blue dots). The x-axis represents the SNP index, and the vertical gray lines indicate the reduced variant sets used in the generalizability evaluation.

Figure 3.

LD heat map plots of 13 rare haplotypes identified in the UK Biobank data. A–M, 13 lead risk haplotypes in Table 2. In each panel, the upper heat map plot displays pairwise D’ (top triangle) and r2 (bottom triangle) values computed using phased genotypes of 181,034 women. The heat map color key is shown in the last row. The bottom plot depicts the estimated LD structure calculated by the Big-LD algorithm [Kim and colleagues (29)] with |r| ≥ 0.50and a maximal distance of 1 Mb between any two correlated variants. The y-axis shows the frequencies of truncated rare haplotypes (whose alleles match those of the original haplotype; cyan segments) and orphan variants (which were not in LD with other variants; blue dots). The x-axis represents the SNP index, and the vertical gray lines indicate the reduced variant sets used in the generalizability evaluation.

Close modal

Generalizability evaluation of 13 rare lead haplotypes for breast cancer risk using the DRIVE OncoArray data

We sought to assess the generalizability of the associations of 13 rare lead haplotypes with breast cancer risk in an independent sample of 30,064 breast cancer cases and 25,282 breast cancer–free controls from the DRIVE study. All 55,346 women included in this analysis were of European genetic ancestry (i.e., those who had PC1 ≤0.0025 and PC2 ≥ −0.007 and were separated by dotted lines in Supplementary Fig. S2B), for which the “white British” of the UKBB cohort is a homogeneous subgroup (Supplementary Fig. S1), and thus we refer this analysis “generalizability evaluation” to a broader group. DRIVE samples were genotyped using the Illumina OncoArray. Because only 12% of genotyped variants in the UKBB-phased data were included in the OncoArray, we performed a genome-wide imputation of the DRIVE OncoArray genotypes to improve coverage using the TOPMed Imputation Server (see Methods for more details). After postimputation QC, we obtained high-quality phased, imputed genotypes on 1,304 (i.e., minimac Rsq ≥ 0.80) of the 1,650 UKBB variants that formed the 13 rare lead haplotypes (Supplementary Fig. S4A). Although MAFs of these individual variants were highly consistent between UKBB and DRIVE (Supplementary Fig. S4B), our analysis of haplotypes was subject to potential phasing/imputing errors. To alleviate this problem, we took the following strategy.

First, using the UKBB data, we reduced the set of variants within the window of each rare lead haplotype to a minimum subset required to define the lead haplotype by limiting to well-imputed variants (i.e., minimac Rsq ≥0.80 in DRIVE) only and applying recursive partitioning, implemented in the R package “rpart,” so that phasing/imputation errors of unnecessary variants would not affect the analysis. As shown in Fig. 2AM, 13 rare lead haplotypes were well approximated by their minimum subsets consisting of 3 to 17 variants, much smaller numbers of variants than what their original haplotype windows contained. We reevaluated the associations of the reduced rare lead haplotypes with breast cancer risk in the UKBB data and confirmed little change in their associations from the original lead haplotype (i.e., P values became slightly less significant due to the approximation; Fig. 2AM). The exception was locus #19, where the lead haplotype could not be approximated. For this locus, we relaxed the Rsq filter to ≥0.30 for common variants (i.e., MAF ≥1%): this resulted in a reduced rare lead haplotype of 4 variants, which yielded a similar association as the original lead haplotype (Fig. 2L). With LD analysis, we found consistently small r2 but relatively large D’ values among variants in the minimum subsets (Supplementary Fig. S5A and S5B).

Next, we evaluated the associations of the reduced rare lead haplotypes with breast cancer risk using the DRIVE data. Using the “Original haplotype calls" scheme (see Materials and Methods), two reduced rare lead haplotypes [locus ID #5: 8 variants, length = 1.40 Mb, OR = 1.50, likelihood ratio test (LRT) P = 0.012; locus ID #13: 3 variants, length = 98 kb, OR = 2.93, LRT P = 0.020] showed associations with breast cancer risk, adjusting for age and the first 10 genotype PCs (Table 3). For the remaining 11 reduced rare lead haplotypes with LRT P > 0.05, we explored two modified haplotype calls where risk-haplotype carriers were called using a subset of variants in the minimal set with the fewest imputation errors. Using the “high-confidence haplotype calls” scheme (see Materials and Methods), two additional reduced rare lead haplotypes of locus ID #17 (7 variants, length = 201 kb, OR = 2.19, LRT P = 0.036) and locus ID #18 (5 variants, length = 458 kb, OR = 7.35, LRT P = 0.019) showed breast cancer–risk associations. Furthermore, allowing one variant to have possibly been imputed poorly, the “high-confidence haplotype calls with one exception” scheme (see Materials and Methods), two additional reduced rare lead haplotypes on locus ID #15 (4 variants, length = 318 kb, OR = 1.48, LRT P = 0.007) and locus ID #16 (5 variants, length = 232 kb, OR = 1.54, LRT P = 0.030) showed associations with breast cancer risk (Table 2). By allowing a one-variant exception, the frequencies of these two lead haplotypes increased, resulting in attenuated but statistically significant effect sizes on breast cancer risk. Taken together, 6 of the 13 reduced rare lead haplotypes showed successful generalizability in the DRIVE imputed data.

Table 3.

Generalizability evaluation results of 6 reduced lead rare haplotypes in the DRIVE data with TOPMed imputation.

UK biobank combined analysisGeneralizability analysis using the DRIVE imputed data
LocusLead haplotypea# VariantsbFrq0cFrq1cHRdPCoxeExemptionfCgFrq0cFrq1cORdPFisherePLRTe
Analysis of the original haplotype calls in the Minimac4 output 
22:28150815–29568762 12.208 31.073 2.48 1.3 × 10−11 11.273 17.130 1.5050 0.013 0.012012 
13 16:60209379–60696127 2.471 9.988 3.76 2.1 × 10−8 0.989 3.160 2.9393 0.014 0.0200 
Extended analysis I with high-confidence haplotype calls 
17 2:13709034–13920338 15.986 32.183 2.10 1.8 × 10−8 0.020 1.780 3.991 2.1919 0.036 0.03636 
18 10:23973011–24495586 2.587 10.543 3.78 7.3 × 10−9 0.002 0.198 1.330 7.3535 0.045 0.0199 
Extended analysis II with high-confidence haplotype calls and 1-variant exemption 
15 13:32552124–32913077 2.994 11.098 3.57 1.3 × 10−8 0.050 14.833 21.288 1.48 0.016 0.0077 
16 4:135906079–136561977 2.645 9.433 3.26 1.2 × 10−6 0.410 7.515 11.642 1.54 0.033 0.030030 
UK biobank combined analysisGeneralizability analysis using the DRIVE imputed data
LocusLead haplotypea# VariantsbFrq0cFrq1cHRdPCoxeExemptionfCgFrq0cFrq1cORdPFisherePLRTe
Analysis of the original haplotype calls in the Minimac4 output 
22:28150815–29568762 12.208 31.073 2.48 1.3 × 10−11 11.273 17.130 1.5050 0.013 0.012012 
13 16:60209379–60696127 2.471 9.988 3.76 2.1 × 10−8 0.989 3.160 2.9393 0.014 0.0200 
Extended analysis I with high-confidence haplotype calls 
17 2:13709034–13920338 15.986 32.183 2.10 1.8 × 10−8 0.020 1.780 3.991 2.1919 0.036 0.03636 
18 10:23973011–24495586 2.587 10.543 3.78 7.3 × 10−9 0.002 0.198 1.330 7.3535 0.045 0.0199 
Extended analysis II with high-confidence haplotype calls and 1-variant exemption 
15 13:32552124–32913077 2.994 11.098 3.57 1.3 × 10−8 0.050 14.833 21.288 1.48 0.016 0.0077 
16 4:135906079–136561977 2.645 9.433 3.26 1.2 × 10−6 0.410 7.515 11.642 1.54 0.033 0.030030 

aThe lead haplotype is labeled as “Chromosome: Start position – end position” based on hg19.

bThe number of imputed variants (Rsq > 0.80 in the TOPMed imputation) among the variants in the haplotype window of size W after reduction using recursive partitioning (minsplit size = 40; CP = 1 × 10−10).

cFrq 0 and Frq1, haplotype frequencies (per 10,000) in breast cancer (BCa) controls and BCa cases of the respective data set, respectively.

dHR/OR, hazard ratio/odds ratio of BCa comparing women with vs. without the haplotype.

ePcox/PFisher/PLRT,P value from Cox regression (with age taken as the time axis)/Fisher exact test/likelihood-ratio test from logistic regression for testing the null hypothesis of no breast cancer-haplotype association (HR or OR = 1.0), adjusting for 10 PCs in the Cox and logistic regression.

fThe index of the exempted variant on the reduced haplotype. /: not applicable or data not available.

gc, HDS error threshold used to define high-confidence haplotypes, subject to the constraint ∑HDSE < c and 0 ≤ c ≤ 0.50.

Through systematically assessing breast cancer–haplotype associations genome-wide in one of the largest publicly available cohorts with GWAS SNP-array data, the UKBB, our analysis identified and replicated 13 rare haplotype associations with breast cancer risk, of which six breast cancer–haplotype associations were found generalizable in DRIVE case–control data. Note that the study is far from a comprehensive assessment of breast cancer–haplotype associations, especially because it used only ∼670,000 autosomal variants genotyped on the UKBB Axiom Array (see below for a more comprehensive discussion of the limitations of our study). Nevertheless, our results have critical implications on the germline autosomal genetic contributions to breast cancer risk.

The 13 replicated, rare-risk haplotypes span multiple LD blocks (Fig. 3AM) and contain variants whose minor alleles are relatively common. But the combination of alleles across the LD blocks, i.e., the haplotype, is rare. LD analysis showed consistently low r2 but relatively high D' values among the alleles of these haplotypes in both UKBB and DRIVE data sets (Supplementary Fig. S5A and S5B). These observations indicate that it is not the individual common alleles (or their chromosomal segments) in regions with occasional recombination (high D’ values), but their specific combinations occurring infrequently (low frequency and r2), that are associated with the large elevation of breast cancer risk. Consistent with this finding, the breast cancer–risk-associated haplotypes identified using smaller windows were common, located in a single LD block, and often tagged by well-known, common, single risk variants. Of note, two (locus IDs: #5 and 15) of the 13 rare loci contain known risk variants reported in the recent meta-GWAS on overall breast cancer. The two rare haplotypes we identified are independent of single known variants but highly correlate with the rare haplotypes formed by known GWAS variants. In other words, the rare haplotype carriers represent a more vulnerable subset (HR = 3.31–4.12) of risk women defined by known variants (Supplementary Tables S4–S9). Our analysis demonstrates the association of this specific configuration of common germline variants with breast cancer risk, as well as the feasibility of utilizing existing GWAS SNP-array data sets in assessing haplotype–disease associations for identifying such rare risk haplotypes.

To support the notion that specific combinations of common alleles (or their segments) are risk elevating, the functional annotations showed many of the variants on the 13 replicated haplotypes have indications for regulatory roles for gene expression, enhancer activity, DNA-motif change, and 3D chromatin interactions. The specific rare combinations of alleles may represent “unfortunate” recombination results with regulatory consequences influencing breast cancer risk. For example, the haplotype on 22q12.1 (locus #5; Fig. 2A), one of the two rare risk haplotypes that were found generalizable in DRIVE and mapped to known breast cancer–risk genes (the other was locus #15 discussed below), spans a 1.4 Mb region with dozens of genes, the majority of which have variants reported by GWAS for their associations with breast cancer risk or breast cancer–specific mortality (3, 27), including MN1 encoding a transcription regulator and an oncogene (30) and CHEK2 involving in the DNA damage repair and cell cycle by interacting with BRCA1/2 and other proteins (31). However, it remains elusive whether these GWAS risk variants influence breast cancer risk through common or independent mechanisms. Our exploration of potential regulatory roles for the 250 variants on the haplotype found (Supplementary Tables S11–S12): (i) 12 are cis-eQTLs of four genes (TTC28, TTC28-AS1, KREMEN1, and CTA-292E10.6/lnc-CCDC117–2) in the GTEx breast mammary tissue (20); (ii) 34 are enhancer variants and possibly regulate 10 genes (MN1, TTC28, CHEK2, ZNRF3, XBP1, HSCB, CCDC117, EWSR1, RHBDD3, and KREMEN1) via enhancer–promoter interactions in breast cancer–related cells (21, 22); (iii) 15 are predicted to affect the binding of 17 unique DNA motifs (25); (iv) 69 reside in the anchor regions of chromatin loops or TADs boundaries, reflecting their role in regulating local chromatin structures (23, 24). The rare haplotype on 13q13.1 (locus #15) mapped to a 361-kb region with multiple genes such as FRY, ZAR1L, and BRCA2 (Fig. 2H). Besides, five genes (N4BP2L1/2, PDS5B, RP11-37E23.5, and RP1-257C22.2) downstream of BRCA2 were also predicted to be targets of eQTLs or enhancer variants on the haplotype in locus #15. FRY encodes a microtubule-binding protein that is essential to mammary gland development and affects breast cancer cell functions (32). ZAR1 L encodes an RNA regulator that can inhibit BRCA2 transcription in a cell cycle–dependent manner (33). APRIN (PDS5 cohesin-associated factor B) encoded by PDS5B is a regulator of the cohesin complex and can interact with BRCA2 in the DNA damage repair and carcinogenesis (34). In summary, the functional annotations showed the following shared features across the 13 replicated, rare risk haplotypes: (i) most regions overlap TAD boundaries and are enriched in chromatin loops; (ii) most of their variants are predicted to alter motif binding (17–83 unique DNA motifs with stronger binding and 8 to 73 unique motifs with weaker binding); and (iii) variants located near their 5′ or 3′ ends are more critical to the observed risk effects (Supplementary Fig. S3). These features support our speculation that the rare risk haplotypes we identified represent long-range interactions with regulatory consequences influencing breast cancer risk.

Alternatively, the rare combination of common alleles may constitute a haplotype on which a risk-altering mutation arose, and the haplotypes we identified may be approximate tags of such mutations. For example, upon searching the TOPMed-imputed data generated on 181,034 UKBB women, we found a BRCA2 nonsense variant K3326X (rs11571833) that coexisted in all women (107 heterozygotes and one homozygote) carrying the rare haplotype of locus #15. In contrast, the BRCA2 variant was present in only 1.7% (3,186 heterozygotes and 15 homozygotes) of 180,926 women who were noncarriers of the rare haplotype (Fisher exact P < 10−20). The corresponding frequencies in the DRIVE TOPMed-imputed data were 63.0% of the rare haplotype carriers versus 1.7% of the noncarriers (Fisher exact P < 10−20). The K3326X is recognized to be associated with increased breast cancer risk (35), but its biological role has not been functionally characterized (36). We hypothesize that the rare risk haplotype of locus #15 captures the risk-increasing effect of the K3326X itself, which might be seen only in the presence of other mutations in the region. In line with the K3326X coexistence with the rare haplotype of locus #15, one relatively common missense variant in MN1 (Q382H or rs45589338), known to be associated with breast cancer risk (3), was found to correlate with the rare haplotype of locus #5 (UKBB: 94.9% of haplotype carriers vs. 7.5% of noncarriers, Fisher exact P < 10−20; DRIVE: 94.3% of haplotype carriers and 7.4% of noncarriers, Fisher exact P < 10−20). In addition, a relatively common missense variant in NPAP1 L (E347K or rs9746819) coexisted with the rare risk haplotype of locus #13 (UKBB: 100% of haplotype carriers vs. 8.3% of noncarriers, Fisher exact P < 10−20; DRIVE: 100% of haplotype carriers vs. 8.5% of noncarriers, Fisher exact P < 10−20). Note that the imputation quality influences the detection of variants, especially those of low frequencies, and we might have been unable to detect certain variants for this reason. Further investigation with deep sequencing of mutational changes in rare haplotype carriers is warranted to distinguish the two possible causal factors the risk haplotypes represent.

The number of risk haplotypes we found was small despite our genome-wide analysis. To interpret this result, we estimated the statistical power of the entire three-stage analysis (discovery/replication with the UKBB data and generalizability evaluation with the DRIVE data). The approximate power for each of the six generalizable, rare risk haplotypes was in the range of 1.3% to 30.6% (Supplementary Table S13). These power estimates suggest that there are additional 2 to 78 generalizable, rare risk haplotypes we did not detect with similar frequencies and risk effects as those that passed all three stages of our analysis. In addition, the UKBB Axiom Array contained a relatively small number of variants. Thus, the haplotypes we examined are a small subset of all haplotypes that can be considered by WGS. Extrapolation of our results suggests that WGS data on sufficiently large cohorts and case–control studies should lead to the discovery of additional rare risk haplotypes as well as rare risk variants.

Aside from the two risk haplotypes (loci #5 and #15) above, only weak evidence exists for the two other generalizable, rare risk haplotypes (loci #16 and #18) with respect to their relevance to breast cancer risk (27, 28). Standard GWAS analyses including ours did not reveal breast cancer–risk associations at P < 0.001 for any variant in loci #16 and #18. The rare haplotype at locus #16 mapped to a 656-Kb region where no protein-coding genes resided (Fig. 2I). Its lead variant rs116648919 was in LD with rs6837069 (D’ = 1.0 and r2 < 0.01; Table 2), which was found to increase breast cancer–specific mortality (OR = 1.92, P = 1×10−6) in a 2019 GWAS (27). The rare haplotype at locus #18 mapped to a 523-Kb region of KIAA1217 (Fig. 2K). Its lead variant rs10764446, located in intron 2 of KIAA1217, is in modest LD (D’ = 0.85 and r2 = 0.53) with another KIAA1217 intronic variant rs11013833, which was reported to associate with the breast–colorectal cancer phenotype (OR = 1.37, P = 6×10−6) in a 2018 GWAS (28). This gene belongs to the large human KIAA gene family, and its expression is highest in female breast mammary tissue in tissue-level RNA sequence data in GTEx (20). An early RNA sequence analysis of breast tumor detected a translocation t(10;14)(p12;q32) in estrogen receptor (ER)–positive breast cancer tumor, resulting in SERPINA1 mRNA fused into the 5′-end of KIAA1217 and therefore causing complete depletion of KIAA1217 at protein-level without any change to its mRNA level remained unchanged (37). These results may suggest the roles of KIAA1217 for normal breast mammary cell functions, which may explain the rare haplotype's association with breast cancer risk. Although only weak evidence exists for loci #16 and #18 with respect to their associations with breast cancer risk (2, 3, 27, 28), the frequency and effect sizes we observed here for their risk haplotypes (HR estimates exceeding 3.4 in our replication analysis) support the rare-variant/haplotype model over the common-variant counterpart. We could not find any previous evidence linking breast cancer and the remaining two generalizable, rare risk haplotypes on 16q21 (locus #13; Fig. 2F) and 2p24.3 (locus #17; Fig. 2J).

Our haplotype analysis relied on phasing accuracy. Although both UKBB and DRIVE are large studies, phasing/imputation uncertainty in calling haplotypes, especially rare haplotypes, is a major concern. However, phasing/imputation errors should affect both breast cancer cases and breast cancer–free women indiscriminately, lowering the power of all three stages of our analysis. Although this issue might explain why seven (locus IDs: # 6, 9, 11, 12, 14, 19, and 20) of the 13 replicated, rare haplotypes were not found generalizable in DRIVE, it should not negate the validity of the six found to be generalizable. The above problem can be partially addressed by incorporating phase information from WGS reads into haplotype estimation or by using advanced sequencing technologies such as long-read or single-cell sequencing (38, 39). Our crude approach to this problem was to use well-imputed variants only, possibly combining with exempting one variant from a haplotype membership in the DRIVE analysis, to account for potential errors in its phase or dosage. By exempting one variant, rare haplotypes were called more liberally with increased frequencies that did not precisely match the original haplotypes, contributing to attenuated OR estimates.

Our analysis was restricted to women of “white British” in the discovery and replication, and women of broader European genetic ancestry in the generalizability evaluation. These were a methodological necessity because populations of different ancestries differ in allele frequencies and LD structures, which could confound the breast cancer–haplotype associations. No population of non-European ancestries has a large breast cancer GWAS data set of the size that is comparable to UKBB and DRIVE. This disparity is an important challenge our field must tackle. Also, we did not consider breast cancer subtypes (e.g., hormone receptor subtypes): risk haplotypes specific to a particular breast cancer subtype (2) may exist.

In conclusion, we have leveraged a large “white British” GWAS cohort data to discover and replicate 13 long, rare haplotypes associated with breast cancer risk through a genome-wide two-stage haplotype analysis, of which six were found generalizable in an independent case–control study of women with European genetic ancestry. Our findings highlight the contribution of rare haplotypes to breast cancer risk and suggest that the rare risk effects are possible of a regulatory nature characterized by long-range interactions. We expect that the ability to assess rare-variant or haplotype associations will be greatly improved when large-scale sequencing data become available. Further efforts are required to characterize the exact causal mechanisms of the rare risk haplotypes. Finally, the existence of rare breast cancer–risk haplotypes is consistent with the patterns of breast cancer incidence observed in large-scale epidemiologic twin and family studies (Yasui and colleagues, submitted as a separate manuscript) and demonstrates an important contribution of rare haplotypes to breast cancer risk.

C. Im reports grants from the NCI during the conduct of the study. No disclosures were reported by the other authors.

F. Wang: Conceptualization, formal analysis, validation, investigation, visualization, writing–original draft. W. Moon: Resources, formal analysis. W. Letsou: Writing–review and editing. Y. Sapkota: Writing–review and editing. Z. Wang: Writing–review and editing. C. Im: Writing–review and editing. J.L. Baedke: Writing–review and editing. L. Robison: Writing–review and editing. Y. Yasui: Conceptualization, resources, supervision, funding acquisition, validation, investigation, visualization, methodology, writing–original draft.

This work is supported by NCI Grant R01 CA216354 (W. Moon, W. Letsou, Y. Sapkota, C. Im, J. Baedke, L. Robison, and Y. Yasui), American Lebanese Syrian Associated Charities (F. Wang, W. Moon, W. Letsou, Y. Sapkota, Z. Wang, J. Baedke, L. Robison, and Y. Yasui), and Alberta Machine Intelligence Institute (C. Im and Y. Yasui). The authors thank the high-performance computing facility at St. Jude Children's Research Hospital for its computational support. This study has been conducted using the UK Biobank Resource under Application Number 44891. OncoArray genotyping and phenotype data harmonization for the Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) breast-cancer case control samples was supported by X01 HG007491 and U19 CA148065 and by Cancer Research UK (C1287/A16563).

The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.

Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).

1.
Schork
NJ
,
Murray
SS
,
Frazer
KA
,
Topol
EJ
.
Common vs. rare allele hypotheses for complex diseases
.
Curr Opin Genet Dev
2009
;
19
:
212
9
.
2.
Zhang
H
,
Ahearn
TU
,
Lecarpentier
J
,
Barnes
D
,
Beesley
J
,
Qi
G
, et al
.
Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses
.
Nat Genet
2020
;
52
:
572
81
.
3.
Michailidou
K
,
Lindstrom
S
,
Dennis
J
,
Beesley
J
,
Hui
S
,
Kar
S
, et al
.
Association analysis identifies 65 new breast cancer risk loci
.
Nature
2017
;
551
:
92
4
.
4.
Khera
AV
,
Chaffin
M
,
Aragam
KG
,
Haas
ME
,
Roselli
C
,
Choi
SH
, et al
.
Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations
.
Nat Genet
2018
;
50
:
1219
24
.
5.
Moller
S
,
Mucci
LA
,
Harris
JR
,
Scheike
T
,
Holst
K
,
Halekoh
U
, et al
.
The heritability of breast cancer among women in the Nordic twin study of cancer
.
Cancer Epidemiol Biomarkers Prev
2016
;
25
:
145
50
.
6.
McClellan
J
,
King
MC
.
Genetic heterogeneity in human disease
.
Cell
2010
;
141
:
210
7
.
7.
Pasaniuc
B
,
Rohland
N
,
McLaren
PJ
,
Garimella
K
,
Zaitlen
N
,
Li
H
, et al
.
Extremely low-coverage sequencing and imputation increases power for genome-wide association studies
.
Nat Genet
2012
;
44
:
631
5
.
8.
Bycroft
C
,
Freeman
C
,
Petkova
D
,
Band
G
,
Elliott
LT
,
Sharp
K
, et al
.
The UK Biobank resource with deep phenotyping and genomic data
.
Nature
2018
;
562
:
203
9
.
9.
Amos
CI
,
Dennis
J
,
Wang
Z
,
Byun
J
,
Schumacher
FR
,
Gayther
SA
, et al
.
The Oncoarray consortium: a network for understanding the genetic architecture of common cancers
.
Cancer Epidemiol Biomarkers Prev
2017
;
26
:
126
35
.
10.
O'Connell
J
,
Sharp
K
,
Shrine
N
,
Wain
L
,
Hall
I
,
Tobin
M
, et al
.
Haplotype estimation for biobank-scale data sets
.
Nat Genet
2016
;
48
:
817
20
.
11.
Genomes Project
C
,
Auton
A
,
Brooks
LD
,
Durbin
RM
,
Garrison
EP
,
Kang
HM
, et al
.
A global reference for human genetic variation
.
Nature
2015
;
526
:
68
74
.
12.
Zheng
X
,
Gogarten
SM
,
Lawrence
M
,
Stilp
A
,
Conomos
MP
,
Weir
BS
, et al
.
SeqArray—a storage-efficient high-performance data format for WGS variant calls
.
Bioinformatics
2017
;
33
:
2251
7
.
13.
Manichaikul
A
,
Mychaleckyj
JC
,
Rich
SS
,
Daly
K
,
Sale
M
,
Chen
WM
.
Robust relationship inference in genome-wide association studies
.
Bioinformatics
2010
;
26
:
2867
73
.
14.
Purcell
S
,
Neale
B
,
Todd-Brown
K
,
Thomas
L
,
Ferreira
MA
,
Bender
D
, et al
.
PLINK: a tool set for whole-genome association and population-based linkage analyses
.
Am J Hum Genet
2007
;
81
:
559
75
.
15.
Abraham
G
,
Qiu
Y
,
Inouye
M
.
FlashPCA2: principal component analysis of Biobank-scale genotype datasets
.
Bioinformatics
2017
;
33
:
2776
8
.
16.
Loh
PR
,
Danecek
P
,
Palamara
PF
,
Fuchsberger
C
,
Reshef
YA
,
Finucane
HK
, et al
.
Reference-based phasing using the Haplotype Reference Consortium panel
.
Nat Genet
2016
;
48
:
1443
8
.
17.
Das
S
,
Forer
L
,
Schonherr
S
,
Sidore
C
,
Locke
AE
,
Kwong
A
, et al
.
Next-generation genotype imputation service and methods
.
Nat Genet
2016
;
48
:
1284
7
.
18.
Taliun
D
,
Harris
DN
,
Kessler
MD
,
Carlson
J
,
Szpiech
ZA
,
Torres
R
, et al
.
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program
.
Nature
2021
;
590
:
290
9
.
19.
Liu
X
,
White
S
,
Peng
B
,
Johnson
AD
,
Brody
JA
,
Li
AH
, et al
.
WGSA: an annotation pipeline for human genome sequencing studies
.
J Med Genet
2016
;
53
:
111
2
.
20.
Consortium
GT
.
The GTEx Consortium atlas of genetic regulatory effects across human tissues
.
Science
2020
;
369
:
1318
30
.
21.
Wang
J
,
Dai
XZ
,
Berry
LD
,
Cogan
JD
,
Liu
Q
,
Shyr
Y
.
HACER: an atlas of human active enhancers to interpret regulatory variants
.
Nucleic Acids Res
2019
;
47
:
D106
-
D12
.
22.
Gao
T
,
Qian
J
.
EnhancerAtlas 2.0: an updated resource with enhancer annotation in 586 tissue/cell types across nine species
.
Nucleic Acids Res
2020
;
48
:
D58
64
.
23.
McArthur
E
,
Capra
JA
.
Topologically associating domain boundaries that are stable across diverse cell types are evolutionarily constrained and enriched for heritability
.
Am J Hum Genet
2021
;
108
:
269
83
.
24.
Salameh
TJ
,
Wang
X
,
Song
F
,
Zhang
B
,
Wright
SM
,
Khunsriraksakul
C
, et al
.
A supervised learning framework for chromatin loop detection in genome-wide contact maps
.
Nat Commun
2020
;
11
:
3428
.
25.
Ward
LD
,
Kellis
M
.
HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease
.
Nucleic Acids Res
2016
;
44
:
D877
D81
.
26.
Bherer
C
,
Campbell
CL
,
Auton
A
.
Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales
.
Nat Commun
2017
;
8
:
14994
.
27.
Escala-Garcia
M
,
Guo
Q
,
Dork
T
,
Canisius
S
,
Keeman
R
,
Dennis
J
, et al
.
Genome-wide association study of germline variants and breast cancer-specific mortality
.
Br J Cancer
2019
;
120
:
647
57
.
28.
Pande
M
,
Joon
A
,
Brewster
AM
,
Chen
WV
,
Hopper
JL
,
Eng
C
, et al
.
Genetic susceptibility markers for a breast-colorectal cancer phenotype: Exploratory results from genome-wide association studies
.
PLoS One
2018
;
13
:
e0196245
.
29.
Kim
SA
,
Cho
CS
,
Kim
SR
,
Bull
SB
,
Yoo
YJ
.
A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs
.
Bioinformatics
2018
;
34
:
388
97
.
30.
Meester-Smoor
MA
,
Molijn
AC
,
Zhao
Y
,
Groen
NA
,
Groffen
CA
,
Boogaard
M
, et al
.
The MN1 oncoprotein activates transcription of the IGFBP5 promoter through a CACCC-rich consensus sequence
.
J Mol Endocrinol
2007
;
38
:
113
25
.
31.
Apostolou
P
,
Papasotiriou
I
.
Current perspectives on CHEK2 mutations in breast cancer
.
Breast Cancer (Dove Med Press)
2017
;
9
:
331
5
.
32.
Liu
Y
,
Chen
X
,
Gong
Z
,
Zhang
H
,
Fei
F
,
Tang
X
, et al
.
Fry is required for mammary gland development during pregnant periods and affects the morphology and growth of breast cancer cells
.
Front Oncol
2019
;
9
:
1279
.
33.
Misra
S
,
Sharma
S
,
Agarwal
A
,
Khedkar
SV
,
Tripathi
MK
,
Mittal
MK
, et al
.
Cell cycle-dependent regulation of the bi-directional overlapping promoter of human BRCA2/ZAR2 genes in breast cancer cells
.
Mol Cancer
2010
;
9
:
50
.
34.
Brough
R
,
Bajrami
I
,
Vatcheva
R
,
Natrajan
R
,
Reis-Filho
JS
,
Lord
CJ
, et al
.
APRIN is a cell cycle specific BRCA2-interacting protein required for genome integrity and a predictor of outcome after chemotherapy in breast cancer
.
EMBO J
2012
;
31
:
1160
76
.
35.
Meeks
HD
,
Song
H
,
Michailidou
K
,
Bolla
MK
,
Dennis
J
,
Wang
Q
, et al
.
BRCA2 polymorphic stop codon K3326X and the risk of breast, prostate, and ovarian cancers
.
J Natl Cancer Inst
2016
;
108
:
djv315
.
36.
Kuznetsov
SG
,
Liu
P
,
Sharan
SK
.
Mouse embryonic stem cell-based functional assay to evaluate mutations in BRCA2
.
Nat Med
2008
;
14
:
875
81
.
37.
Asmann
YW
,
Necela
BM
,
Kalari
KR
,
Hossain
A
,
Baker
TR
,
Carr
JM
, et al
.
Detection of redundant fusion transcripts as biomarkers or disease- specific therapeutic targets in breast cancer
.
Cancer Res
2012
;
72
:
1921
8
.
38.
Kronenberg
ZN
,
Rhie
A
,
Koren
S
,
Concepcion
GT
,
Peluso
P
,
Munson
KM
, et al
.
Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C
.
Nat Commun
2021
;
12
:
1935
.
39.
Porubsky
D
,
Sanders
AD
,
van Wietmarschen
N
,
Falconer
E
,
Hills
M
,
Spierings
DC
, et al
.
Direct chromosome-length haplotyping by single-cell sequencing
.
Genome Res
2016
;
26
:
1565
74
.