Abstract
Over 20 susceptibility single-nucleotide polymorphisms (SNP) have been identified for esophageal adenocarcinoma (EAC) and its precursor, Barrett esophagus (BE), explaining a small portion of heritability.
Using genetic data from 4,323 BE and 4,116 EAC patients aggregated by international consortia including the Barrett's and Esophageal Adenocarcinoma Consortium (BEACON), we conducted a comprehensive transcriptome-wide association study (TWAS) for BE/EAC, leveraging Genotype Tissue Expression (GTEx) gene-expression data from six tissue types of plausible relevance to EAC etiology: mucosa and muscularis from the esophagus, gastroesophageal (GE) junction, stomach, whole blood, and visceral adipose. Two analytical approaches were taken: standard TWAS using the predicted gene expression from local expression quantitative trait loci (eQTL), and set-based SKAT association using selected eQTLs that predict the gene expression.
Although the standard approach did not identify significant signals, the eQTL set–based approach identified eight novel associations, three of which were validated in independent external data (eQTL SNP sets for EXOC3, ZNF641, and HSP90AA1).
This study identified novel genetic susceptibility loci for EAC and BE using an eQTL set–based genetic association approach.
This study expanded the pool of genetic susceptibility loci for EAC and BE, suggesting the potential of the eQTL set–based genetic association approach as an alternative method for TWAS analysis.
Introduction
Incidence of esophageal adenocarcinoma (EAC) has risen sharply over recent decades (1–4), and it is now the predominant subtype of esophageal cancer in the USA and many other Western countries. Patients diagnosed with advanced EAC have a 5-year survival rate below 20% (5–10). Progress has been made in identifying genetic and environmental risk factors for EAC and its epithelial precursor lesion, Barrett esophagus (BE; ref. 11). GE reflux (12, 13), obesity (14, 15), and tobacco smoking (16, 17) collectively explain up to ∼75% of cancer risk (18–20). Although over 20 susceptibility single-nucleotide polymorphisms (SNP) have been identified through genome-wide association studies (GWAS) in the Barrett's and Esophageal Adenocarcinoma Consortium (BEACON) and related efforts (21–29), these loci explain only a small portion of overall heritability (h2g estimated as 0.25 for EAC; 0.35 for BE; ref. 30), and few have been linked specifically to progression to cancer (22, 29).
One of the notable methodological advances in the post-GWAS era is integrating the transcriptome into genetic association analyses (31, 32). Evidence is abundant that trait-associated SNPs are more likely to be expression quantitative trait loci (eQTL; ref. 33), which are pervasive in the human genome (34–38). Motivated by the premise that eQTL may influence disease phenotypes by altering gene-expression levels, association approaches leveraging eQTL and transcriptome data in genotype-tissue expression (GTEx; ref. 38), namely, transcriptome-wide association studies (TWAS), have become a mainstream approach in post-GWAS analyses, leading to the discovery of multiple novel susceptibility genes (39–42), for prostate (43), ovarian (44), breast (45), and colorectal cancers (46, 47). Notably, the initial TWAS method—PrediXcan (31)—first builds a genetic prediction model for gene expression, then assesses genetically predicted gene expression for its association with a trait of interest, much resembling the classic instrumental variable regression approach in econometrics. In the same vein, newer methods for TWAS (32, 48) removed the requirement of individual-level genetic data by exploiting GWAS summary statistics and genetic correlation data from external data such as the 1000 Genomes Project. Recognizing a portion of eQTL regulation of gene expression can be conservative across tissues, new methodological development for TWAS has been focused on leveraging multiple tissues available in GTEx for improving power of genetic prediction and subsequent association (49, 50). On the other hand, it has been noted with caution recently that TWAS can be prone to spurious results with expression data from non–trait-related tissues or cell types, and that the best practice may be choosing the most mechanistically related tissue(s) available (51).
In our view, the main challenge to apply TWAS to the EAC genetic research is that there is not yet a large set of BE samples, the mechanistically relevant tissue for EAC development, with both germline genotypes and transcriptome data available for eQTL mapping. Although the inherited genetic component of risk for BE largely coincides with that for EAC (30), the cellular origin of BE remains controversial, with hypotheses ranging from residual embryonic cells at the GE junction to undifferentiated gastric cells in the cardia (52–54). Although the GTEx Project collected four upper gastrointestinal tract tissues, including mucosa and muscularis from the esophagus, GE junction, and stomach, a limitation of bulk RNA-sequencing data is that transcriptome profiles of rarer constituent cell types (such as progenitor cells) may not be well delineated.
In this work, we conducted a comprehensive TWAS study for BE/EAC, leveraging six GTEx (V8) tissue types of plausible relevance to EAC etiology: mucosa and muscularis from the esophagus, GE junction, stomach, visceral adipose, and whole blood. The inclusion of the latter two tissues is in recognition that tissues beyond the esophagogastric mucosa are likely to contribute biologically to the origins of BE/EAC. Abdominal obesity is a risk factor for these conditions, which not only affects reflux severity but also increases levels of systemic inflammation through the release of secreted mediators (55–57). Chronic inflammation is considered an important driver of BE/EAC pathogenesis, and the roles and contributions of circulating and infiltrating immune cells are under active investigation (58). We selected eQTLs collectively predicting RNA-sequencing–based expression and built prediction models. The eQTLs predicting protein-coding genes were assessed for gene-level associations with BE/EAC risk using a discovery data set (BEACON/Cambridge GWAS), and top signals identified were then advanced for evaluation using an independent GWAS data set from Bonn, Germany. We used two methods to assess gene-set associations for selected eQTL: (i) standard PrediXcan (31), computing a linear combination prediction of gene expression, and (ii) the sequence kernel association test (SKAT; ref. 59), testing gene-set association among selected eQTLs that predict gene expressions. Originally developed for rare-variant association tests, SKAT was used here to assemble genetic associations from eQTL without using the prediction weights derived from an extant gene-expression data set, e.g., GTEx. An eQTL-based aggregate association strategy has been reported previously (60), though the previous method used the sum of 1-df chi-square values for the individual eQTLs. The following rationales motivate the eQTL aggregation strategy: (i) the existing GTEx tissue types may not capture the cellular origin of BE; (ii) even if the etiologically relevant cell types were contained in one of GTEx tissues, genetically predicted gene-expression derived from bulk tissue RNA-seq profiles may not adequately represent the genetic component of gene expression in rarer yet etiologically relevant cell types within that tissue. Therefore, we hypothesize that the gene-expression prediction weights for eQTLs derived from GTEx may not be always appropriate for the targeted genetic risk prediction for BE and EAC, and we postulate that a more flexible set-based global test of selected eQTLs may improve the likelihood of capturing genetic associations with disease risk which are otherwise obscured when evaluating surrogate gene-expression measures from bulk tissue.
Materials and Methods
Individual-level data and summary statistics from existing GWAS
Genome-wide association data from three genetic studies were obtained for this analysis. Given that our analytic plan encompassed a multitude of correlated analyses and included exploratory methodologic comparisons, e.g., two analysis strategies for six tissues and three trait comparisons of interest (BE vs. control, EAC vs. control and BE/EAC vs. control), a discovery–validation approach was adopted to better control the potential false-positive results. For the discovery set, individual-level genotype data were available from the BEACON consortium (dbGaP phs000869.v1.p1; 2,413 BE cases, 1,512 EAC cases, and 6,718 control participants) and the Cambridge GWAS (873 BE cases, 995 EAC cases, and 3,408 control participants); for validation, SNP summary statistics were available from the Bonn GWAS (1,037 BE cases, 1,609 EAC cases, and 3,537 control participants). After quality control, the discovery set included 702,492 SNPs on autosomal chromosomes. An additional 4,541 controls of European ancestry were obtained from the database of Genotypes and Phenotypes (dbGaP; phs000187.v1.p1, phs000196.v2.p1, and phs000524.v1.p1) and merged with the BEACON discovery data to increase statistical power to detect risk loci. The Michigan imputation server (61) was used to impute genotype data on chromosomes 1–22, with the most accurate and largest panel—the Haplotype Reference Consortium (HRC; Version r1.1 2016) for European (EUR) as the population reference. Imputed genotype data included 5,312,829 SNPs with imputation quality score > 0.4, MAF > 0.05, call rate >95% and Hardy–Weinberg equilibrium P > 1e−5. For the Bonn data set, imputation was previously carried out using the 1000 Genomes Phase1 EUR reference panel, and imputed genotype data included a total of ∼9 million SNPs with minor allele frequency >0.001.
GTEx germline sequencing data and RNA-seq transcriptome data for eQTL prediction of gene expression
GTEx data (V8) from subjects of European ancestry were used in this analysis. RNA-seq gene-expression data were retrieved from six tissues of plausible biological relevance to EAC development (esophagus GE junction: n = 275, esophagus–mucosa: n = 411, esophagus–muscularis: n = 385, stomach: n = 260, adipose–visceral: n = 393, and whole blood: n = 558). Transcripts per million data were downloaded, and the trimmed mean of M values (TMM) normalization method was implemented in edgeR (62). For each gene in a tissue, gene-expression values were standardized across samples. SNP genotypes were obtained from whole-genome sequencing data for ∼46,569,000 variants.
The expression levels for a gene were modeled using an ElasticNet linear model with local SNPs in a 1 Mb region flanking the transcription starting site (TSS) of the gene, and covariates including the top four genotype principal components, top 15 Probabilistic Estimation of Expression Residuals (PEER) factors, sex, age, sequencing platform indicator (Illumina HiSeq 2000 or HiSeq X), and sequencing protocol indicator (PCR based or PCR-free). The elastic net model was implemented using the R package glmnet (63). Highly correlated SNPs with Pearson correlation >0.9 were removed before running the elastic net model. The penalty parameter was selected by the minimum ten-fold cross-validation error. The ten-fold cross-validated R2 for genetically predicted gene expression was used to summarize the strength of genetic prediction. The distribution of R2 for predicting gene expressions in a tissue is displayed by a violin plot. Genes with estimated R2 > 0.01 (correlation >0.1) for a tissue entered subsequent genetic association analysis, using the SNPs with nonzero estimated coefficients identified as eQTLs.
eQTL set–based association analysis in the discovery set (Beacon and Cambridge individual-level data)
For each of six tissues and three trait comparisons (BE vs. control, EAC vs. control, BE/EAC combined vs. control), gene-set association analyses were conducted by the following two approaches. First, in the standard TWAS approach, predicted gene expression from the GTEx-derived ElasticNet model was assessed for its association with the trait by a logistic model, adjusting for sex, age, and the top six genotype principal components. Second, the selected eQTLs from the GTEx-derived ElasticNet model were assessed for their collective association with the trait by SKAT (59), adjusting for the same set of covariates. Manhattan plots were drawn to show P values for gene SNP sets by chromosome. False discovery rate [Benjamini–Hochberg false discovery rate (FDR)] was used to account for multiple testing. For genes of interest for discovery, individual SNP–trait associations were also assessed and plotted using LocusZoom software. To determine whether an identified gene-set association is caused by a previously identified risk SNP in the neighborhood, a SKAT model was also fitted to include the known GWAS SNP in the region.
Validation of the discovered eQTL set–based associations in the Bonn data set
Gene-level eQTL SNP sets putatively associated with a trait were next evaluated using Bonn GWAS summary data. The SKAT association statistics for the gene sets were approximated by a score-statistic method (64), using univariate summary statistics and the genetic correlation matrix computed from European ancestry participants of the 1000 Genomes Project. For a few SNPs in the discovery set but missing in the validation set due to different imputation panels, we used the closest SNPs within 50 bp and with correlation >0.6, whenever available, as the proxy to minimize the impact of missing SNPs. To account for multiple testing in the validation stage, the Hochberg adjusted P value for controlling family-wise error rate was used.
Data availability
The BEACON data with supplemented controls were obtained from dbGaP (phs000869.v1.p1, phs000187.v1.p1, phs000196.v2.p1, and phs000524.v1.p1). The GTEx genotype and gene-expression data were obtained from dbGaP (phs000424.v8.p2).
Results
cis-eQTL predicted gene expressions in six etiologically relevant tissues
Transcriptome data and germline whole-genome sequencing data from GTEx (V8) were analyzed for building genetic prediction models for gene expressions in each of six etiologically relevant tissues for BE/EAC: esophageal mucosa (n = 411), esophageal muscularis (n = 385), GE junction (n = 275), stomach (n = 260), adipose–visceral (n = 393), and whole blood (n = 558). Common SNPs (MAF > 0.05) located within ±500 kb of the TSS of a gene were identified from GTEx whole-genome sequencing data and selected to predict the transcript abundance by the ElasticNet method. Figure 1A shows the violin plots of R2 for genes with at least 1 SNP being selected and R2 ≥ 0.01 (correlation of observed and predicted gene expression ≥0.1) in the six tissues. The four tissues in the upper GI tract (esophageal mucosa and muscularis, GE junction, and stomach) have a greater number of predictable genes and higher R2 in this subset: esophageal mucosa has the largest number of genes with R2 ≥ 0.01 (n = 7,463); GE junction has the highest median (0.037) despite the smaller sample size for junction tissue. There is substantial variability among the numbers of “genetically predictable” genes across tissues (5,160 in blood ∼7,463 in esophageal mucosa), and the genes shared between tissues. The latter is exemplified by a Venn diagram in Fig. 1B, which shows the overlapping set between the three esophageal tissues. Between any two tissue types, 35%–45% of genes are not shared, underscoring the significance of both cross-tissue and tissue-specific genetic regulation of gene expression.
eQTL set–based association analysis identified susceptibility loci for BE and EAC in BEACON/Cambridge discovery set
The selected eQTLs in the genetic prediction models were assessed for association with BE, EAC, or BE/EAC as a combined trait in BEACON/Cambridge discovery set, using two methods: (i) the standard TWAS approach, where the predicted gene expression from the ElasticNet model was assessed for its association with the trait in a logistic regression model; (ii) the gene-set association method SKAT, using the selected eQTLs from the ElasticNet model. Across six tissues and for three trait association comparisons, there are a total of 116,853 gene-set associations being tested. Because of high correlations between P values from the same genes across different tissues and trait comparisons, the Bonferroni procedure can be overly conservative. Instead, FDR) was used to adjust for multiple testing, in part because it can effectively account for correlation. Figure 2 shows the Manhattan plots of P values for the three comparisons using the two methods (standard TWAS on the bottom of each panel and SKAT-eQTL on the top). No genes analyzed by the standard TWAS approach satisfied FDR < 0.05 (minimum FDR = 0.187, e.g., EXOC3, BARX1, and LDAH). In contrast, the SKAT-eQTL method identified a total of 21 genes with significant associations at FDR < 0.05 (red dotted line in Fig. 2), representing a mix of novel and known loci. Table 1 shows eight novel eQTL set–based associations in six loci that either have not been reported previously or are independent of the known GWAS SNP in conditional analysis. Table 2 shows 13 loci that have been previously linked to susceptibility, containing putative genes including LDAH, BARX1, ALDH1A2, and CRTC1.
Locus . | Gene . | Tissue . | Trait . | # eQTLs . | R2 . | Most significant GWAS SNP . | Distance (Mba) . | TWAS-Pb . | SKAT-Pc . | FDRd . | P-adje . |
---|---|---|---|---|---|---|---|---|---|---|---|
5p15.33 | EXOC3 | Adipose | BE/EA | 28 | 0.042 | rs9918259 | 0.22 | 1.81 × 10–5 | 8.24 × 10–6 | 0.037 | 2.98 × 10–3 |
6q14.1 | SENP6 | Blood | BE/EA | 51 | 0.046 | rs76014404 | 13.92 | 2.23 × 10–3 | 7.12 × 10–6 | 0.036 | 5.32 × 10–6 |
11q13.4 | KRTAP5-8 | Adipose | EA | 23 | 0.047 | rs4930068 | 69.26 | 0.27 | 1.41 × 10–5 | 0.049 | 3.28 × 10–5 |
12q13.11 | ZNF641 | Junction | BE/EA | 4 | 0.014 | rs1247942 | 65.90 | 5.16 × 10–6 | 3.81 × 10–6 | 0.025 | 2.26 × 10–6 |
14q32.31 | HSP90AA1 | Blood | BE | 94 | 0.040 | — | — | 0.66 | 8.49 × 10–6 | 0.037 | — |
16q23.1 | CFDP1 | Stomach | BE/EA | 5 | 0.011 | rs1979654 | 11.07 | 3.98 × 10–3 | 4.29 × 10–6 | 0.026 | 2.67 × 10–6 |
16q23.1 | CHST5 | Junction | BE | 21 | 0.026 | rs1979654 | 10.84 | 0.36 | 2.45 × 10–6 | 0.022 | 3.14 × 10–6 |
16q23.1 | BCAR1 | Blood | BE | 335 | 0.021 | rs1979654 | 11.14 | 0.13 | 6.08 × 10–6 | 0.034 | 1.12 × 10–5 |
Locus . | Gene . | Tissue . | Trait . | # eQTLs . | R2 . | Most significant GWAS SNP . | Distance (Mba) . | TWAS-Pb . | SKAT-Pc . | FDRd . | P-adje . |
---|---|---|---|---|---|---|---|---|---|---|---|
5p15.33 | EXOC3 | Adipose | BE/EA | 28 | 0.042 | rs9918259 | 0.22 | 1.81 × 10–5 | 8.24 × 10–6 | 0.037 | 2.98 × 10–3 |
6q14.1 | SENP6 | Blood | BE/EA | 51 | 0.046 | rs76014404 | 13.92 | 2.23 × 10–3 | 7.12 × 10–6 | 0.036 | 5.32 × 10–6 |
11q13.4 | KRTAP5-8 | Adipose | EA | 23 | 0.047 | rs4930068 | 69.26 | 0.27 | 1.41 × 10–5 | 0.049 | 3.28 × 10–5 |
12q13.11 | ZNF641 | Junction | BE/EA | 4 | 0.014 | rs1247942 | 65.90 | 5.16 × 10–6 | 3.81 × 10–6 | 0.025 | 2.26 × 10–6 |
14q32.31 | HSP90AA1 | Blood | BE | 94 | 0.040 | — | — | 0.66 | 8.49 × 10–6 | 0.037 | — |
16q23.1 | CFDP1 | Stomach | BE/EA | 5 | 0.011 | rs1979654 | 11.07 | 3.98 × 10–3 | 4.29 × 10–6 | 0.026 | 2.67 × 10–6 |
16q23.1 | CHST5 | Junction | BE | 21 | 0.026 | rs1979654 | 10.84 | 0.36 | 2.45 × 10–6 | 0.022 | 3.14 × 10–6 |
16q23.1 | BCAR1 | Blood | BE | 335 | 0.021 | rs1979654 | 11.14 | 0.13 | 6.08 × 10–6 | 0.034 | 1.12 × 10–5 |
aDistance between the gene and the most significant GWAS risk SNP identified from previous GWAS.
bP value for association analyses in standard TWAS.
cP value for eQTL gene-set SKAT association.
dFDR based on the pooled set of P values for eQTL gene-set SKAT associations across three trait comparisons and six tissues.
eP value derived from the SKAT model adjusting for GWAS risk SNPs.
One consistent theme in Tables 1 and 2 is that SKAT-eQTL produced uniformly smaller P values than the standard TWAS method for these eQTL set associations, suggesting that the weighted linear combination of eQTL for predicting these gene expressions may not always be powerful to capture genetic associations. For example, EXOC3 is located in locus 5p15.33, 220 kb away from the known risk SNP rs9918259, with its flanking region containing 28 selected cis-eQTLs in adipose tissue, explaining 4.2% of the variability in its gene expression. When assessed by SKAT, this set of SNPs was significantly associated with BE/EAC with P 8.24 × 10–6 (FDR = 0.0365). The standard TWAS analysis yielded a larger P value, 1.81 × 10–5. Figure 3 shows the regional plots of the three novel loci that were discovered in BEACON/Cambridge discovery set and validated in Bonn data. Specifically, Fig. 3A shows a cluster of cis-eQTLs, located in the EXOC3 gene, that were individually associated with BE/EAC at a moderate level of significance (P 10–2–10–4). Previous meta-analysis identified rs9918259 as a risk SNP in CEP72/TPPP. Adjusting for rs9918259 in the SKAT regression model attenuated the P value for EXOC3 gene-set association from 8.24 × 10–6 to 2.98 × 10–3, suggesting that the SNP set of eQTL predicting EXOC3 gene expression may add new evidence for association at this locus.
Locus . | Gene . | Tissue . | Trait . | # eQTLs . | R2 . | Most significant GWAS SNP . | Distance (Mba) . | TWAS-Pb . | SKAT-Pc . | FDRd . | P-adje . |
---|---|---|---|---|---|---|---|---|---|---|---|
2p24.1 | LDAH | Junction | BE/EA | 25 | 0.179 | rs7255 | 0.005 | 1.94 × 10–4 | 7.11 × 10–8 | 2.37 × 10–3 | 0.429 |
2p24.1 | GDF7 | Blood | BE/EA | 15 | 0.079 | rs3072 | 0.012 | 0.146 | 3.38 × 10–7 | 5.67 × 10–3 | 0.641 |
3q27.1 | YEATS2 | Stomach | EA | 6 | 0.018 | rs9823696 | 0.368 | 0.160 | 5.18 × 10–6 | 0.030 | 0.942 |
3q27.1 | ABCF3 | Adipose | EA | 43 | 0.063 | rs9823696 | 0.120 | 0.223 | 1.45 × 10–5 | 0.049 | 0.339 |
9q22.32 | BARX1 | Adipose | BE/EA | 5 | 0.028 | rs11789015 | 0.002 | 4.35 × 10–6 | 3.03 × 10–6 | 0.022 | 0.386 |
15q21.3 | ALDH1A2 | Adipose | BE/EA | 19 | 0.023 | rs66725070 | 0.022 | 3.58 × 10–3 | 7.87 × 10–7 | 0.012 | 0.506 |
15q21.3 | AQP9 | Blood | BE/EA | 206 | 0.018 | rs2464469 | 0.068 | 0.573 | 8.08 × 10–6 | 0.037 | 0.054 |
19p13.11 | JUND | Blood | BE/EA | 347 | 0.011 | rs10419226 | 0.413 | 0.221 | 2.62 × 10–6 | 0.022 | 0.041 |
19p13.11 | SSBP4 | Blood | BE | 44 | 0.011 | rs10419226 | 0.273 | 0.904 | 1.82 × 10–6 | 0.019 | 0.006 |
19p13.11 | ISYNA1 | Blood | BE/EA | 206 | 0.028 | rs10419226 | 0.258 | 0.679 | 1.35 × 10–5 | 0.046 | 0.085 |
19p13.11 | KLHL26 | Blood | BE/EA | 32 | 0.015 | rs10419226 | 0.055 | 0.212 | 3.21 × 10–7 | 5.67 × 10–3 | 0.428 |
19p13.11 | CRTC1 | Adipose | BE/EA | 21 | 0.061 | rs10419226 | 0.009 | 0.130 | 6.61 × 10–8 | 2.37 × 10–3 | 0.002 |
19p13.11 | TMEM161A | Mucosa | BE/EA | 4 | 0.013 | rs10423674 | 0.412 | 4.25 × 10–3 | 9.02 × 10–10 | 1.07 × 10–4 | 0.203 |
Locus . | Gene . | Tissue . | Trait . | # eQTLs . | R2 . | Most significant GWAS SNP . | Distance (Mba) . | TWAS-Pb . | SKAT-Pc . | FDRd . | P-adje . |
---|---|---|---|---|---|---|---|---|---|---|---|
2p24.1 | LDAH | Junction | BE/EA | 25 | 0.179 | rs7255 | 0.005 | 1.94 × 10–4 | 7.11 × 10–8 | 2.37 × 10–3 | 0.429 |
2p24.1 | GDF7 | Blood | BE/EA | 15 | 0.079 | rs3072 | 0.012 | 0.146 | 3.38 × 10–7 | 5.67 × 10–3 | 0.641 |
3q27.1 | YEATS2 | Stomach | EA | 6 | 0.018 | rs9823696 | 0.368 | 0.160 | 5.18 × 10–6 | 0.030 | 0.942 |
3q27.1 | ABCF3 | Adipose | EA | 43 | 0.063 | rs9823696 | 0.120 | 0.223 | 1.45 × 10–5 | 0.049 | 0.339 |
9q22.32 | BARX1 | Adipose | BE/EA | 5 | 0.028 | rs11789015 | 0.002 | 4.35 × 10–6 | 3.03 × 10–6 | 0.022 | 0.386 |
15q21.3 | ALDH1A2 | Adipose | BE/EA | 19 | 0.023 | rs66725070 | 0.022 | 3.58 × 10–3 | 7.87 × 10–7 | 0.012 | 0.506 |
15q21.3 | AQP9 | Blood | BE/EA | 206 | 0.018 | rs2464469 | 0.068 | 0.573 | 8.08 × 10–6 | 0.037 | 0.054 |
19p13.11 | JUND | Blood | BE/EA | 347 | 0.011 | rs10419226 | 0.413 | 0.221 | 2.62 × 10–6 | 0.022 | 0.041 |
19p13.11 | SSBP4 | Blood | BE | 44 | 0.011 | rs10419226 | 0.273 | 0.904 | 1.82 × 10–6 | 0.019 | 0.006 |
19p13.11 | ISYNA1 | Blood | BE/EA | 206 | 0.028 | rs10419226 | 0.258 | 0.679 | 1.35 × 10–5 | 0.046 | 0.085 |
19p13.11 | KLHL26 | Blood | BE/EA | 32 | 0.015 | rs10419226 | 0.055 | 0.212 | 3.21 × 10–7 | 5.67 × 10–3 | 0.428 |
19p13.11 | CRTC1 | Adipose | BE/EA | 21 | 0.061 | rs10419226 | 0.009 | 0.130 | 6.61 × 10–8 | 2.37 × 10–3 | 0.002 |
19p13.11 | TMEM161A | Mucosa | BE/EA | 4 | 0.013 | rs10423674 | 0.412 | 4.25 × 10–3 | 9.02 × 10–10 | 1.07 × 10–4 | 0.203 |
aDistance between the gene and the most significant GWAS risk SNP identified from previous GWAS.
bP value derived from association analyses in standard TWAS.
cP value derived from association analyses in SKAT-TWAS.
dFDR based on P value derived from association analyses in SKAT-TWAS.
eP value derived from association analyses in SKAT-TWAS after adjusting for GWAS risk SNP.
The remaining seven eQTL sets in Table 1 are all >10 Mb away from the closest existing GWAS SNP; adjusting for existing GWAS SNPs did not reduce the gene-set association significance. Of top interest is HSP90AA1 (65, 66), which is located on chromosome 14, with no GWAS risk SNPs previously identified. This gene was identified by the association of 94 eQTLs in blood with BE. The regional plot in Fig. 3E shows widespread individual eQTL associations over a 1-Mb window around this gene, exemplifying the power of gene-set association in aggregating signals that may not reach genome-wide significance individually. Four eQTLs of the ZNF641 (67) gene were identified in esophageal junction, collectively associated with BE/EAC (P = 3.81 × 10–6), and also individually associated with BE/EAC with less significance (Fig. 3E). Table 2 shows loci identified by eQTL gene-set association analyses which have been previously linked to risk of BE and EAC. All are located within 0.5 Mb of an existing GWAS SNP; all but three (CRTC1, SSBP4, and JUND) became nonsignificant when adjusting for the closest GWAS SNP(s). Six eQTL sets including CRTC1, SSBP4, and JUND are from 19q13.11, a locus harboring three risk SNPs in CRTC1 that have been consistently detected in previous GWAS efforts. The top association in this locus is THEM161A (P = 9.02 × 10–10). Adjusting for the three known risk SNPs in the region (rs10423674, rs10419226, and rs199620551) does not completely remove association significance for CRTC1, JUND, and SSBP4, suggesting that this region may have independent risk alleles other than the three known risk SNPs. The locus 3q27.1 contains rs9823696, the only previous risk SNP linked to EAC but not BE. Rather than HTR3C and ABCC5, the nearest genes to this GWAS variant, gene-set analysis implicated YEATS2 and ABCF3. The remaining genes in Table 2 were reported in prior GWAS as candidate risk genes based on their proximity to index SNPs: LDAH/GDF7 (2p24.1), BARX1 (9q22.32), and ALDH1A2/AQP9 (15q21.3).
Three eQTL set associations in Table 1 were replicated using Bonn GWAS data
We evaluated all association signals identified in Table 1 using the SKAT method and summary statistics from the Bonn GWAS (with 1000 Genomes EUR LD structure; Table 3). Because the trait association analyses were based on HRC imputation, and the available Bonn summary statistics were imputed based on 1000 Genomes data (V3), a small number of SNPs were missing for some of the eight genes. eQTL SNP sets for three genes had evidence of replication in the Bonn data, using the Hochberg adjusted P < 0.05 as the threshold for confirming candidate associations: EXOC3, ZNF641, and HSP90AA1. In particular, EXOC3 (P = 0.0000185) and ZNF641 (P = 0.00378) remained significant even using the stringent Bonferroni correction. Twelve of 26 eQTLs for EXOC3 had a univariate P < 0.05 (minimum P 1.69 × 10–7). Three of four eQTLs for ZNF641 also have univariate P < 0.05.
Locus . | Gene . | #eQTLs . | Trait . | #snps in Bonn . | #snps (P < 0.05)a . | MinPb . | Bonn_Pc . | Hochberg_P . |
---|---|---|---|---|---|---|---|---|
5p15.33 | EXOC3 | 28 | BE/EA | 26 | 12 | 1.69 × 10–7 | 1.85 × 10–5 | 1.48 × 10–4 |
6q14.1 | SENP6 | 51 | BE/EA | 48 | 3 | 3.55 × 10–4 | 5.01 × 10–2 | 0.189 |
11q13.4 | KRTAP5-8 | 23 | EA | 20 | 2 | 8.02 × 10–4 | 0.162 | 0.324 |
12q13.11 | ZNF641 | 4 | BE/EA | 4 | 3 | 4.26 × 10–3 | 3.78 × 10–3 | 0.026 |
14q32.31 | HSP90AA1 | 94 | BE | 89 | 14 | 0.012 | 8.28 × 10–3 | 0.049 |
16q23.1 | CFDP1 | 5 | BE/EA | 5 | 1 | 0.028 | 0.053 | 0.189 |
16q23.1 | CHST5 | 21 | BE | 20 | 2 | 9.81 × 10–3 | 0.063 | 0.189 |
16q23.1 | BCAR1 | 335 | BE | 328 | 11 | 4.63 × 10–3 | 0.464 | 0.464 |
Locus . | Gene . | #eQTLs . | Trait . | #snps in Bonn . | #snps (P < 0.05)a . | MinPb . | Bonn_Pc . | Hochberg_P . |
---|---|---|---|---|---|---|---|---|
5p15.33 | EXOC3 | 28 | BE/EA | 26 | 12 | 1.69 × 10–7 | 1.85 × 10–5 | 1.48 × 10–4 |
6q14.1 | SENP6 | 51 | BE/EA | 48 | 3 | 3.55 × 10–4 | 5.01 × 10–2 | 0.189 |
11q13.4 | KRTAP5-8 | 23 | EA | 20 | 2 | 8.02 × 10–4 | 0.162 | 0.324 |
12q13.11 | ZNF641 | 4 | BE/EA | 4 | 3 | 4.26 × 10–3 | 3.78 × 10–3 | 0.026 |
14q32.31 | HSP90AA1 | 94 | BE | 89 | 14 | 0.012 | 8.28 × 10–3 | 0.049 |
16q23.1 | CFDP1 | 5 | BE/EA | 5 | 1 | 0.028 | 0.053 | 0.189 |
16q23.1 | CHST5 | 21 | BE | 20 | 2 | 9.81 × 10–3 | 0.063 | 0.189 |
16q23.1 | BCAR1 | 335 | BE | 328 | 11 | 4.63 × 10–3 | 0.464 | 0.464 |
aNumber of SNPs with univariate P < 0.05 in Bonn data.
bMinimum univariate P value in Bonn data.
cValidation SKAT P value in Bonn data.
Discussion
Advanced esophageal adenocarcinoma is a deadly disease with rising incidence in Western countries. International efforts including BEACON and other European studies have identified ∼20 susceptibility SNPs, though collectively these risk SNPs explain only a small portion of the genetic heritability. Polygenic risk scores based on GWAS hits have not been able to significantly improve prediction beyond environmental risk factors. In our view, current genetic studies have reached a plateau in discovery, largely due to the rarity of the cancer and the limited available sample sizes. In this work, we conducted TWAS for EAC and BE, using data from six etiologically relevant GTEx tissues. The standard TWAS method using predicted gene expression was compared with a novel approach that assessed gene-level eQTL set associations by SKAT. Individual-level genetic data from BEACON/Cambridge were used as the discovery set and summary statistics of genetic data from Bonn were used as the validation set. Using a significance threshold of FDR < 0.05 in the discovery set, standard TWAS identified no associations, while the SKAT approach yielded 13 eQTL set associations in 11 loci. Among them, 8 eQTL set associations (5 loci) are novel findings, either representing novel susceptibility regions without previously identified risk SNPs, or in the case of EXOC3, a novel signal independent of known risk SNPs in the neighborhood. Among the genes from the known loci, our results suggest that there are potentially susceptibility genes at 19p13.11 independent of the three known risk SNPs. Notably, the loci identified by eQTL set associations largely did not overlap with loci previously reported for gastrooesophageal reflux disease (GERD), a major risk factor for BE/EAC and a clinical trait by itself. For eQTLs in Tables 1 and 2, only one eQTL rs9636202 (ISYNA1) was found to be a GERD risk locus previously reported (68). These results represent a significant advance in identifying novel inherited genetic risk associations for BE/EAC, since the publication of the first GWAS meta-analysis in 2016 (22). Although we underscore the importance of discovering new candidate risk loci for a rare cancer with limited study samples available, we also acknowledge that caution is needed in interpreting eQTL set associations. Functional laboratory studies are essential to identify causal variants and genes that are driving observed associations.
One interesting observation is that there is no finding from the standard approach, while SKAT-eQTL produced 13 significant eQTL set associations. All set associations in Tables 1 and 2 have smaller P values from SKAT-eQTL, some substantially more significant (e.g., HSP90AA1 and KRTAP5-8, for which P values of the standard TWAS are greater than 0.05). This analysis is a single observation of applying two methods applied to GWAS data sets and, therefore, does not establish the power comparison between the two methods. A formal power comparison using simulated genetic data sets is necessary and will be pursued in future work. Here we merely conjecture why an eQTL set ensemble association may be advantageous in the context of BE/EAC studies. The prevailing theory for TWAS using imputed gene expression is that the genetic component of gene expression for genes relevant to disease etiology can be accurately predicted, using etiologically relevant tissue samples. As noted previously, however, a complicating factor is that most tissues are comprised of multiple cell types in varying abundance, and only a subset of these cell types—often representing a small fraction of overall cells—may be the most relevant to disease development. This indeed may be the case for BE/EAC, as candidates for the cell-of-origin include subpopulations of precursors in the GE junction or gastric cardia. That said, cancer development is a multifaceted process involving not only stem-like precursor cells but also the tissue microenvironment, comprised of and influenced by multiple constituent cell types. It is quite possible that for this reason, a more flexible gene-set association method, such as SKAT, may outperform standard TWAS. If this hypothesis is true, one may envision that the SKAT–eQTL association approach could be applied successfully on a wider scale and similarly help accelerate the discovery of novel loci for other types of cancers.
Compared with the standard GWAS analysis for individual SNP associations, the eQTL set–based aggregation approach provides better power in detecting loci that contain multiple eQTL association signals. In Supplementary Fig. S1, we show the locus zoom plots for the individual SNP associations combining the discovery and validation set from the eight loci identified in Table 1. None of the SNPs in these regions reached the P value cutoff of 5 × 10–8. However, the set-based method can assemble and detect multiple eQTL associations in a locus with single variants reaching moderate levels of significance.
Rarely used in TWAS, the discovery–validation design we adopted is partly driven by the multitude of analyses we intended to conduct using two analytical approaches and six potentially relevant tissues, and for three trait comparisons of interest, as well as the lack of access to individual-level genetic data in Bonn study. The latter is a limitation of the current analysis. Confirmation of discovery–stage associations, using an independent validation data set, reduces the possibility of false-positive discoveries. As noted before, though motivated by the goal of deciphering causal pathways of disease etiology, the original TWAS approach is not immune to spurious findings. This is particularly true for the SKAT–eQTL approach—it is essentially a gene-set association method assembling risk associations that are individually unlikely to survive the multiple-testing penalty. Three novel eQTL set associations (EXOC3, SENP6, and HSP90AA1) were validated by the Bonn summary data, each of which has multiple individual risk SNPs with a moderate level of association (Fig. 3).
We end our discussion with several other limitations and future work. First, while delivering several novel susceptibility signals, the power of our TWAS is limited by the sample size of BE/EAC cases due to its rarity. Compared with existing TWAS for other cancers, our sample size is much smaller. New population-based genetic studies are needed to improve power and further advance EAC genetic research. Second, our analyses were restricted to European ancestry participants. Although the incidence of EAC in whites is much higher than that in African Americans, future studies are needed for eQTL and TWAS in broader populations. Third, although our genetic findings are promising, experimental studies are needed to understand the mechanisms of genetic risk predisposition.
Authors' Disclosures
P. Gharahkhani reports grants from NHMRC (Investigator Grant) during the conduct of the study. R.C. Fitzgerald reports a patent for Cytosponge licensed to Medtronic; and Rebecca Fitzgerald is named on patents related to Cytosponge and related assays which have been licensed by the Medical Research Council to Covidien GI Solutions (now Medtronic) and is a co-founder of CYTED Ltd. These are not directly involved in the topic of this paper. D.A. Corley reports grants from NIH during the conduct of the study. L. Bernstein reports grants from NCI during the conduct of the study. N.J. Shaheen reports grants from NIH during the conduct of the study; grants from Lucid, Steris, Medtronic, Pentax, CDx Medical, Interpace Diagnostics, and Phathom, personal fees from Aqua, Cook Medical, Castle, Exact outside the submitted work. P.D. Pharoah reports grants from Cancer Research UK during the conduct of the study. P.G. Iyer reports grants from Exact Sciences, Pentax Medical, Cernostics, other support from Exact Sciences, CDx Medical, Ambu, and Cernostics outside the submitted work. M. Venerito reports personal fees from Nordic Pharma, Merck Serono, Bayer Vital, Lilly, grants and personal fees from Sirtex, Ipsen, BMS, MSD, Eisai, Amgen, and AstraZeneca outside the submitted work. M.M. Nöthen reports grants from Deutsche Forschungsgemeinschaft during the conduct of the study; personal fees from Life&Brain GmbH outside the submitted work. H. Manner reports grants from Erbe Elektromedizin, Germany outside the submitted work. D.C. Whiteman reports grants from the National Health and Medical Research Council of Australia during the conduct of the study. S. MacGregor reports grants from the Australian National Health and Medical Research Council during the conduct of the study. T.L. Vaughan reports grants from the Fred Hutch Cancer Center during the conduct of the study. No disclosures were reported by the other authors.
Authors' Contributions
X. Wang: Formal analysis, visualization, methodology, writing–original draft. P. Gharahkhani: Resources, investigation, writing–review and editing. D.M. Levine: Resources, investigation, writing–review and editing. R.C. Fitzgerald: Resources, investigation, writing–review and editing. I. Gockel: Resources, investigation, writing–review and editing. D.A. Corley: Resources, investigation, writing–review and editing. H.A. Risch: Resources, investigation, writing–review and editing. L. Bernstein: Resources, investigation, writing–review and editing. W.-H. Chow: Resources, investigation, writing–review and editing. L. Onstad: Resources, investigation, writing–review and editing. N.J. Shaheen: Resources, investigation, writing–review and editing. J. Lagergren: Resources, investigation, writing–review and editing. L.J. Hardie: Resources, investigation, writing–review and editing. A.H. Wu: Resources, investigation, writing–review and editing. P.D.P. Pharoah: Resources, investigation, writing–review and editing. G. Liu: Resources, investigation, writing–original draft. L.A. Anderson: Resources, investigation, writing–review and editing. P.G. Iyer: Resources, investigation, writing–review and editing. M.D. Gammon: Resources, investigation, writing–review and editing. C. Caldas: Resources, investigation, writing–review and editing. W. Ye: Resources, investigation, writing–review and editing. H. Barr: Resources, investigation, writing–review and editing. P. Moayyedi: Resources, investigation, writing–review and editing. R. Harrison: Resources, investigation, writing–review and editing. R.G.P. Watson: Resources, investigation, writing–review and editing. S. Attwood: Resources, investigation, writing–review and editing. L. Chegwidden: Resources, investigation, writing–review and editing. S.B. Love: Resources, investigation, writing–review and editing. D. MacDonald: Resources, investigation, writing–review and editing. J. deCaestecker: Resources, investigation, writing–review and editing. H. Prenen: Resources, investigation, writing–review and editing. K. Ott: Resources, investigation, writing–review and editing. S. Moebus: Resources, investigation, writing–review and editing. M. Venerito: Resources, investigation, writing–review and editing. H. Lang: Resources, investigation, writing–review and editing. R. Mayershofer: Resources, investigation, writing–review and editing. M. Knapp: Resources, investigation, writing–review and editing. L. Veits: Resources, investigation, writing–review and editing. C. Gerges: Resources, investigation, writing–review and editing. J. Weismüller: Resources, investigation, writing–review and editing. M. Reeh: Resources, validation, writing–review and editing. M.M. Nöthen: Resources, investigation, writing–review and editing. J.R. Izbicki: Resources, investigation, writing–review and editing. H. Manner: Resources, investigation, writing–review and editing. H. Neuhaus: Resources, investigation, writing–review and editing. T. Rösch: Resources, investigation, writing–review and editing. A.C. Böhmer: Resources, investigation, writing–review and editing. A.H. Hölscher: Resources, investigation, writing–review and editing. M. Anders: Resources, investigation, writing–review and editing. O. Pech: Resources, investigation, writing–review and editing. B. Schumacher: Resources, investigation, writing–review and editing. C. Schmidt: Resources, investigation, writing–review and editing. T. Schmidt: Resources, investigation, writing–review and editing. T. Noder: Resources, investigation, writing–review and editing. D. Lorenz: Resources, investigation, writing–review and editing. M. Vieth: Resources, investigation, writing–review and editing. A. May: Resources, investigation, writing–review and editing. T. Hess: Resources, investigation, writing–review and editing. N. Kreuser: Resources, investigation, writing–review and editing. J. Becker: Resources, investigation, writing–review and editing. C. Ell: Resources, investigation, writing–review and editing. I. Tomlinson: Resources, investigation, writing–review and editing. C. Palles: Resources, investigation, writing–review and editing. J.A. Jankowski: Resources, investigation, writing–review and editing. D.C. Whiteman: Resources, investigation, writing–review and editing. S. MacGregor: Resources, investigation, writing–review and editing. J. Schumacher: Resources, investigation, writing–review and editing. T.L. Vaughan: Supervision, writing–review and editing. M.F. Buas: Data curation, supervision, writing–review and editing. J.Y. Dai: Conceptualization, formal analysis, supervision, funding acquisition, methodology, writing–original draft.
Acknowledgments
J.Y. Dai and X. Wang were supported by the NCI grant R01 CA222833. M.F. Buas was supported by NCI grant R03CA223731. Computation is in part supported by NIH grant S10OD028685 to Fred Hutch Cancer Research Center. A full list of acknowledgments for consortia data used for this study is provided in the Supplementary Data.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: Supplementary data for this article are available at Cancer Epidemiology, Biomarkers & Prevention Online (http://cebp.aacrjournals.org/).