Abstract
Background: Colorectal cancer (CRC) in densely affected families without Lynch Syndrome may be due to mutations in undiscovered genetic loci. Familial linkage analyses have yielded disparate results; the use of exome sequencing in coding regions may identify novel segregating variants.
Methods: We completed exome sequencing on 40 affected cases from 16 multicase pedigrees to identify novel loci. Variants shared among all sequenced cases within each family were identified and filtered to exclude common variants and single-nucleotide variants (SNV) predicted to be benign.
Results: We identified 32 nonsense or splice-site SNVs, 375 missense SNVs, 1,394 synonymous or noncoding SNVs, and 50 indels in the 16 families. Of particular interest are two validated and replicated missense variants in CENPE and KIF23, which are both located within previously reported CRC linkage regions, on chromosomes 1 and 15, respectively.
Conclusions: Whole-exome sequencing identified DNA variants in multiple genes. Additional sequencing of these genes in additional samples will further elucidate the role of variants in these regions in CRC susceptibility.
Impact: Exome sequencing of familial CRC cases can identify novel rare variants that may influence disease risk. Cancer Epidemiol Biomarkers Prev; 22(7); 1239–51. ©2013 AACR.
Introduction
Colorectal cancer (CRC) is the third most common cancer and the third leading cause of cancer-related death in the United States for both men and women (1). Family history is a consistent risk factor (2); without CRC family history, the lifetime risk for an individual is 5% to 6%, but 10% to 15% if a first-degree relative has CRC (3–5) and 30% to 100% in familial genetic syndromes (6). Lynch Syndrome represents up to 5% of CRCs and results from germline mutations that affect DNA mismatch repair (MMR) genes MLH1, MSH2, MSH6, and PMS2. Tumors from these patients show a defective MMR (dMMR) phenotype manifested by DNA microsatellite instability (MSI) and absence of MMR protein expression (7, 8).
Beyond the known familial genetic syndromes, linkage studies have implicated several additional regions in CRC susceptibility, including 3q21-24, 4q21, 7q31, 8q13, 8q23, 8q24, 9q22-31, 10p14, 11q23, 12q24, 15q22, and 18q21 (9–21). Genome-wide association studies (GWAS) of CRC have reported evidence of many common risk variants in several genetic regions, including chromosomes 1q41, 3q26, 6p21, 6q25, 8q23, 8q24, 9p24, 10p14, 11q13, 12q13, 12q24, 14q22, 15q13, 16q22, 18q21, 19q13, 20p12, 20q13, and Xp22 (14, 15, 19, 21–30). However, results from linkage studies have not been consistent, and GWAS are not ideal for the identification of rare variants. Hypothesizing that coding regions may harbor rare variants segregating with susceptibility, we sequenced the exomes of 40 affected individuals from 16 CRC families. To our knowledge, this represents the first family-based application of massively parallel sequencing in this disease (31–36).
Materials and Methods
Study participants
We used the Colon Cancer Family Registry (Colon CFR), a National Cancer Institute (NCI)–supported consortium established to create an infrastructure for interdisciplinary studies of the genetic and molecular epidemiology of CRC (37–39). Families were enrolled via the Mayo Clinic (Rochester, MN), Memorial University of Newfoundland (St. Johns, NL, Canada), the University of Southern California (Los Angeles, CA), or the University of Melbourne (Melbourne, VIC, Australia; ref. 37). Risk-factor data, blood samples, and pathology reports were collected on participants, using standardized core protocols, and germline DNA was isolated from blood. Sixty-six pedigrees were reviewed with 2 or more invasive CRC cases and no evidence of Lynch syndrome, MUTYH mutations (37), or familial adenomatous polyposis. Sixteen families were selected on the basis of the presumption of a genetic predisposition to disease due to (i) large numbers of affected relatives, and (ii) younger ages at diagnosis (Table 1). Forty affected individuals were chosen for sequencing based on genetic relatedness (preferring distant relatives), including 3 cases per family where possible (Supplementary Fig. S1). All aspects of this work received Institutional Review Board approval under the policies of the Colon CFR.
Family characteristics, sequencing conditions, and number of shared variants identified
. | . | . | . | . | . | N Variants (N private) . | ||||
---|---|---|---|---|---|---|---|---|---|---|
Family . | N Affected . | Mean age at diagnosis (range) . | N Sequenced (relation) . | Library capture . | Sequencing platform . | Nonsense, splice site . | Indel . | Missense . | Other . | Total . |
1 | 8 | 48.8 (25–72) | 2 (First cousins) | 36 Mb | GAIIx | 1 (0) | 4 (4) | 15 (14) | 39 (32) | 59 (50) |
2 | 4 | 49.0 (31–68) | 3 (Siblings) | 36 Mb | GAIIx | 3 (3) | 5 (5) | 22 (20) | 74 (57) | 104 (85) |
3 | 6 | 64.2 (51–69) | 3 (Siblings) | 36 Mb | GAIIx | 1 (0) | 2 (2) | 16 (16) | 47 (36) | 66 (54) |
4 | 6 | 58.0 (44–89)a | 2 (First cousins once removed) | 36 Mb | GAIIx | 2 (2) | 8 (7) | 15 (11) | 51 (38) | 76 (58) |
5 | 6 | 61.0 (50–79)a | 3 (Siblings) | 36 Mb | GAIIx | 1 (1) | 4 (2) | 28 (28) | 61 (54) | 94 (85) |
6 | 3 | 50.8 (39–71)a | 3 (Siblings) | 36 Mb | GAIIx | 3 (3) | 3 (3) | 17 (15) | 52 (45) | 75 (66) |
7 | 4 | 66.0 (48–73) | 3 (2 Siblings, 1 first cousin) | 36 Mb | GAIIx | 1 (1) | 1 (1) | 2 (1) | 8 (7) | 12 (10) |
8 | 6 | 64.2 (50–79) | 3 (Avuncular pair, first cousin) | 36 Mb | GAIIxb | 1 (0) | 2 (1) | 8 (6) | 42 (22) | 53 (29) |
9 | 5 | 51.0 (40–68)a | 3 (2 Siblings, 1 first cousin) | 36 Mb | GAIIxb | 3 (1) | 3 (1) | 7 (4) | 51 (27) | 64 (33) |
10 | 5 | 50.2 (28–66) | 2 (Avuncular pair) | 36 Mb | GAIIxb | 1 (0) | 8 (6) | 37 (33) | 173 (115) | 219 (154) |
11 | 6 | 66.3 (54–81)a | 2 (First cousins) | 36 Mb | HiSeq 2000 | 4 (2) | 6 (4) | 30 (26) | 120 (74) | 160 (106) |
12 | 3 | 49.7 (42–56) | 2 (Avuncular pair) | 36 Mb | HiSeq 2000 | 2 (2) | 9 (8) | 59 (57) | 169 (112) | 239 (179) |
13 | 5 | 61.4 (53–72) | 2 (First cousins) | 36 Mb | HiSeq 2000 | 2 (0) | 4 (3) | 19 (16) | 125 (79) | 150 (98) |
14 | 3 | 49.7 (34–63) | 2 (Avuncular pair) | 36 Mb | HiSeq 2000 | 9 (7) | 7 (5) | 28 (25) | 161 (111) | 205 (148) |
15 | 5 | 57.4 (45–67) | 2 (Avuncular pair) | 36 Mb | GAIIx, HiSeq 2000b | 5 (3) | 4 (2) | 43 (35) | 173 (140) | 225 (180) |
16 | 5 | 59.2 (56–67) | 3 (Siblings) | 50 Mb | HiSeq 2000 | 5 (5) | 8 (8) | 53 (51) | 291 (240) | 357 (304) |
. | . | . | . | . | . | N Variants (N private) . | ||||
---|---|---|---|---|---|---|---|---|---|---|
Family . | N Affected . | Mean age at diagnosis (range) . | N Sequenced (relation) . | Library capture . | Sequencing platform . | Nonsense, splice site . | Indel . | Missense . | Other . | Total . |
1 | 8 | 48.8 (25–72) | 2 (First cousins) | 36 Mb | GAIIx | 1 (0) | 4 (4) | 15 (14) | 39 (32) | 59 (50) |
2 | 4 | 49.0 (31–68) | 3 (Siblings) | 36 Mb | GAIIx | 3 (3) | 5 (5) | 22 (20) | 74 (57) | 104 (85) |
3 | 6 | 64.2 (51–69) | 3 (Siblings) | 36 Mb | GAIIx | 1 (0) | 2 (2) | 16 (16) | 47 (36) | 66 (54) |
4 | 6 | 58.0 (44–89)a | 2 (First cousins once removed) | 36 Mb | GAIIx | 2 (2) | 8 (7) | 15 (11) | 51 (38) | 76 (58) |
5 | 6 | 61.0 (50–79)a | 3 (Siblings) | 36 Mb | GAIIx | 1 (1) | 4 (2) | 28 (28) | 61 (54) | 94 (85) |
6 | 3 | 50.8 (39–71)a | 3 (Siblings) | 36 Mb | GAIIx | 3 (3) | 3 (3) | 17 (15) | 52 (45) | 75 (66) |
7 | 4 | 66.0 (48–73) | 3 (2 Siblings, 1 first cousin) | 36 Mb | GAIIx | 1 (1) | 1 (1) | 2 (1) | 8 (7) | 12 (10) |
8 | 6 | 64.2 (50–79) | 3 (Avuncular pair, first cousin) | 36 Mb | GAIIxb | 1 (0) | 2 (1) | 8 (6) | 42 (22) | 53 (29) |
9 | 5 | 51.0 (40–68)a | 3 (2 Siblings, 1 first cousin) | 36 Mb | GAIIxb | 3 (1) | 3 (1) | 7 (4) | 51 (27) | 64 (33) |
10 | 5 | 50.2 (28–66) | 2 (Avuncular pair) | 36 Mb | GAIIxb | 1 (0) | 8 (6) | 37 (33) | 173 (115) | 219 (154) |
11 | 6 | 66.3 (54–81)a | 2 (First cousins) | 36 Mb | HiSeq 2000 | 4 (2) | 6 (4) | 30 (26) | 120 (74) | 160 (106) |
12 | 3 | 49.7 (42–56) | 2 (Avuncular pair) | 36 Mb | HiSeq 2000 | 2 (2) | 9 (8) | 59 (57) | 169 (112) | 239 (179) |
13 | 5 | 61.4 (53–72) | 2 (First cousins) | 36 Mb | HiSeq 2000 | 2 (0) | 4 (3) | 19 (16) | 125 (79) | 150 (98) |
14 | 3 | 49.7 (34–63) | 2 (Avuncular pair) | 36 Mb | HiSeq 2000 | 9 (7) | 7 (5) | 28 (25) | 161 (111) | 205 (148) |
15 | 5 | 57.4 (45–67) | 2 (Avuncular pair) | 36 Mb | GAIIx, HiSeq 2000b | 5 (3) | 4 (2) | 43 (35) | 173 (140) | 225 (180) |
16 | 5 | 59.2 (56–67) | 3 (Siblings) | 50 Mb | HiSeq 2000 | 5 (5) | 8 (8) | 53 (51) | 291 (240) | 357 (304) |
aAge of diagnosis unknown for one or 2 individual(s); these individuals were excluded from calculation.
bSamples in these families were sequenced twice and the results were pooled for analysis.
Library preparation, target capture, and sequencing
Because of the rapid pace of technologic advances during the course of our experiments, library capture and sequencing conditions varied by family (Table 1). Exome capture was completed using Agilent's 36 Mb (n = 37 individuals) or 50 Mb (n = 3) All Human Exon chip. Libraries were sequenced once on Illumina's GAIIx (n = 19), twice on a GAIIx (n = 8), once on a HiSeq 2000 (n = 11), or once each on a GAIIx and HiSeq 2000 (n = 2). All samples were run in a single lane of a flow cell; samples run twice were sequenced in separate flow cells and BAM files from separate runs were merged for analysis.
Libraries were prepared following manufacturers' protocols (Illumina and Agilent). Briefly, 3 μg of genomic DNA was fragmented to 150 to 200 bp using the Covaris E210 sonicator. The ends were repaired, and an A base was added to the 3′ ends. Paired end DNA adaptors (Illumina) with a single T base overhang at the 3′ end were ligated and the resulting constructs were purified using AMPure SPRI beads from Agencourt. Adapter-modified DNA fragments were enriched by 4 cycles of PCR using PE 1.0 forward and PE 2.0 reverse primers (Illumina). Concentration and size distributions were determined on an Agilent Bioanalyzer DNA 1000 chip. Whole-exon capture used the protocol for Agilent's SureSelect Human All Exon kit (36 or 50 Mb). Five hundred nanogram of the prepared library was incubated with whole-exon biotinylated RNA capture baits supplied in the kit for 24 hours at 65°C. Captured DNA:RNA hybrids were recovered using Dynabeads MyOne Streptavidin T1 from Dynal and DNA was eluted from the beads and purified using Ampure XP (Beckman). Purified capture products were amplified using the SureSelect GA PCR primers (Agilent) for 12 cycles. Libraries were loaded onto paired-end flow cells at concentrations of 6 to 8 pmol/L (GAIIx) or 4 to 5 pmol/L (HiSeq 2000) to generate cluster densities of 250,000 to 350,000 per tile (GAIIx) or 300,000 to 500,000 per mm2 (HiSeq 2000) following Illumina's standard protocol using the Illumina cluster station and paired end cluster kit version 4 (GAIIx) or the Illumina cBot and HiSeq Paired-end cluster kit version 1 (HiSeq 2000).
Illumina GAIIx flow cells were sequenced as 101 × 2 paired-end indexed reads using SBS sequencing kit version 4 and SCS version 2.5 data collection software; base-calling used Illumina's Pipeline version 1.5.1. Illumina HiSeq 2000 flow cells were sequenced as 101 × 2 paired-end reads using TruSeq SBS sequencing kit version 1 and HiSeq 2000 data collection version 1.1.37.0; base-calling used Illumina's RTA version 1.7.45.0. Results from samples run in duplicate or triplicate were pooled for analysis.
Bioinformatics
Sequences were analyzed using TREAT (Targeted RE-sequencing and Annotation Tool) for sequence alignment, variant calling, functional prediction, and variant annotation (40). Reads were aligned to the human reference genome using BWA and duplicated read levels were evaluated using SAMtools's rmdup method (41–43). The BWA alignment was improved using the Genome Analysis Toolkit (GATK; ref. 44) local realignments. Single-nucleotide variants (SNV) were called using SNVMix (45), and indels were called by GATK with default parameter settings. A 0.8 SNVMix posterior probability threshold was chosen for filtering based on analysis of a Utah residents with ancestry from northern and western Europe (CEU) sample sequenced by the 1000 Genomes Project (46). Variants located within the target regions were retained. SIFT (47) and SeattleSeq (http://snp.gs.washington.edu/SeattleSeqAnnotation131/) provided functional annotation. Read depths at each variant position and the average mapping quality score were generated by curating the BAM pile-up files using SAMtools (42). Potential splice variants were defined as those within 2 bp of exon–intron boundaries; eSplices were those within coding regions. We excluded reads and variants with poor mapping or quality scores [Qphred score <20 and probability score (0.8) and required a minimum number of high-quality reads supporting alternative alleles (10 for SNVs and 3 for indels)]. During read alignment, we identified several reads that aligned to off-target coding regions with high mapping scores and expanded the target region to 80 Mb to include high-quality reads in the Agilent capture definition.
Variant filtering and analysis
Shared variants (shared among all sequenced cases within each family) with minor allele frequency (MAF) < 0.05 (dbSNP Build 130) were identified and categorized either as a nonsense or splice SNV, missense SNV, other SNV (synonymous variants and noncoding), or a frame-shifting indel variant. We took 2 analysis approaches based on the following 2 questions: what novel genes harbor variants that may cause predisposition to CRC and what variants can be found in genes and regions previously associated with CRC? And what variants can be found in genes and regions previously associated with CRC? First, as an agnostic approach, we excluded variants not likely to be disease-causing, based on the following criteria: (i) shared in 4 or more pedigrees (likely representing artifacts or reference sequence annotation errors or newly identified common variants); (ii) MAF ≥ 0.01 in CEU populations (HapMap, 1000 Genomes, and Beijing Genome Institute, Hangzhou, China); (iii) annotation errors of nonsense and splice-site variants (variants incorrectly identified as nonsense or splice site were correctly categorized then subjected to the standard exclusion criteria); (iv) prediction of pathogenicity (missense and indel variants predicted to be benign by PolyPhen or tolerated by SIFT); (v) indel variants in splice sites that did not alter the splice site. Second, we examined variants in a priori candidate genomic regions using less stringent filtering criteria. All variants present that were shared among family members, even those with a low probability of causing disease, were included. These included: (i) 27 known CRC susceptibility genes (AKT1, APC, AXIN1, AXIN2, BLM, BMPR1A, BRCA1, BRCA2, CHEK2, GALNT12, MCC, MLH1, MLH3, MSH2, MSH3, MSH6, MUTYH, MYH11, PMS1, PMS2, PTEN, SMAD4, SMAD7, STK11, TGFB1, TGFBR2, and TP53); (ii) previously identified linkage regions (3q21-24, 4q21, 7q31, 8q13, 8q23, 8q24, 9q22-31, 10p14, 11q23, 12q24, 15q22, and 18q21), and (iii) GWAS regions (1q41, 3q26.2, 8q24, 9p24, 10p14, 11q23, 12q13, 14q22, 16q22, 18q21, 19q13, and 20p12; refs. 9–27). In 15 families, we examined regions with family-specific dominant or recessive logarithm of odds (LOD) scores > 1.3 found in multipoint linkage analysis using MERLIN version 1.1.2 (48) and genotype data from Affymetrix Linkage 2.0 or Illumina Linkage Panel 12 arrays, as described previously (12). Expected sharing was calculated as in described in Feng and colleagues and compared with actual sharing for each family (49).
Technical validation and replication of select variants
Variants prioritized from the whole-exome sequencing were validated using Sanger sequencing. Primers were designed using GRCh37/hg19 reference assembly for selected nonsense, splice site, and missense variants identified in families 1, 2, 3, 8, 9, 12, and 16. Sequencing was carried out on 16 individuals within the families that had been whole-exome sequencing (WES) to validate the variants and on 31 available relatives with DNA to look at segregation of the variant with additional CRC- and polyp-affected and unaffected relatives. Briefly, 25 ng of leukocyte DNA was amplified in a 15 μL PCR containing 7.5 μL of GoTaq master mix (Promega) and 5 pmol/L of each primer (available on request). Reactions were cycled on a Bio-Rad iCycler (Bio-Rad) using the following profile: 94°C for 2 minutes, followed by 45 cycles of 94°C for 15 seconds, 60°C for 15 seconds, and 72°C for 15 seconds, cycling was finalized at 72°C for 5 minutes. PCR reactions were subsequently cleaned up using Montage PCR96 Cleanup plates (Millipore) according to the manufacturer's guidelines. PCR product (0.5 μL) was then used in an 8 μL sequencing reaction comprising 0.4 μL BigDye Terminator v3.1, 1.4 μL 5× reaction buffer and 1.5 pmol/L of either primer. Reactions were cycled for 96°C for 1 minute, followed by 25 cycles of 96°C for 10 seconds, 50°C for 5 seconds and 60°C for 90 seconds. Before running on an ABI3100 genetic analyzer (Applied Biosystems), sequencing reactions were cleaned up using Xterminator reagent (Applied Biosystems) according to the manufacturer's instructions. Resultant sequences were analyzed using SeqMan software (DNASTAR).
Nonparametric linkage analyses used MERLIN version 1.1.2 and nonparametric Kong & Cox LOD (NPL) scores were computed for validated SNVs (48).
Results
Comparison of exome capture and sequencing platforms
We completed germline exome sequencing of 40 cases from 16 familial CRC families (Table 1). Cases were selected to be distant relatives to decrease the number of shared, nonsusceptibility variants. Sequencing technologies advanced rapidly during our work; both capture and sequencing technologies were updated, providing an opportunity for comparison across platforms. As expected, more variants were identified in samples captured with the 50-Mb chip than the 36-Mb chip. Most variant types were increased modestly, with 3 notable exceptions: intergenic indels increased by 21.6-fold and indels near the 3′ and 5′ ends of a gene were increased by 15- and 4.3-fold, respectively, when using the 50-Mb compared with the 36-Mb chip (Supplementary Table S1). These increases likely represent the expanded target of the 50-Mb chip; similar increases are expected for future versions targeting more untranslated region (UTR) regions. Samples run twice on a GAIIx showed an approximately 2-fold increase in the number of reads and variants identified compared with those which ran once (Supplementary Fig. S2A and Table 1). Samples that sequenced twice on a GAIIx also increased the coverage similar to that of samples run once on a HiSeq 2000 (Supplementary Fig. S2B).
Agnostic search for novel loci identifies candidate genes
As described in the Materials and Methods, variant filtering was applied to the full whole-exome sequence dataset. We found that, on average, affected cases within families shared 33 variants (Table 1). There was a great disparity in the number of shared variants by family and platform, with as few as 4 shared variants in family 7 and up to 70 in family 12. The majority of shared variants (75.8%) were synonymous or noncoding (intronic or intergenic). Missense variants represented 18.5% of all variants, whereas indels and nonsense or splice-site variants represented 3.6% and 2.1% of variants, respectively. On average, related cases shared approximately 3 nonsense or splice-site variants within a family, of which 2 were private.
Thirty-five nonsense or splice-site variants shared among affected family members were identified, each in a unique gene (Table 2). Although most of these variants were found in only one family, multiple families shared the one variant observed in each of SHROOM3, CDC27, ARSD, H2BFM, and TMC2.
Shared nonsense and splice site SNVs
Chr . | Position . | rsID . | Gene . | DNA change . | AA change . | Families . |
---|---|---|---|---|---|---|
1 | 85,546,961 | — | WDR63 | G/T | GLU>stop | 6 |
148,932,885 | rs1048214 | LOC645166 | C/T | GLN>stop | 5 | |
152,277,622 | — | FLG | G/T | SER>stop | 9 | |
2 | 166,904,221 | — | SCN1A | G/T | TYR>stop | 16 |
3 | 40,231,748 | — | MYRIP | A/T | LYS>stop | 15 |
75,787,516 | — | ZNF717 | C/G | Splice Site | 4 | |
4 | 77,660,829 | rs73826426 | SHROOM3 | C/A | TYR>stop | 8, 10 |
6 | 49,425,475 | — | MUT | G/A | ARG>stop | 6 |
121,560,230 | — | C6orf170 | G/A | ARG>stop | 2 | |
168,226,602 | — | C6orf124 | C/T | TRP>stop | 7 | |
7 | 76,751,068 | rs71555938 | CCDC146 | G/C | TYR>stop | 16 |
10 | 123,846,924 | — | TACC2 | C/T | GLN>stop | 4 |
135,490,903 | rs36130162 | DUX1 | C/T | ARG>stop | 16 | |
11 | 1,857,515 | — | SYT8 | G/A | Splice Site | 14 |
12 | 4,870,307 | rs61758971 | GALNT8 | C/T | GLN>stop | 14 |
109,690,964 | — | ACACB | G/A | Splice Site | 14 | |
13 | 24,243,246 | — | TNFRSF19 | C/T | ARG>stop | 12 |
14 | 21,779,981 | — | RPGRIP1 | A/G | Splice Site | 16 |
58,832,019 | rs62621193 | ARID4A | G/A | Splice Site | 14 | |
15 | 75,562,499 | — | GOLGA6C | C/T | GLN>stop | 14 |
78,807,407 | — | AGPHD1 | T/A | TYR>stop | 14 | |
16 | 24,873,990 | — | SLC5A11 | G/A | TRP>stop | 2 |
84,495,318 | rs4782970 | ATP2C2 | A/C | Splice Site | 15 | |
17 | 45,234,277 | — | CDC27 | A/C | Splice Site | 13, 14 |
18 | 14,513,786 | rs8095431 | POTEC | T/C | Splice Site | 4 |
19 | 11,943,225 | — | ZNF440 | C/T | ARG>stop | 12 |
40,195,184 | — | LGALS14 | G/A | Splice Site | 16 | |
43,699,204 | — | PSG4 | C/A | GLU>stop | 4 | |
56,459,551 | — | NLRP8 | C/T | ARG>stop | 6 | |
20 | 2,597,716 | — | TMC2 | A/T | Splice Site | 1, 9 |
30,226,904 | — | COX4I2 | T/G | Splice Site | 14 | |
22 | 44,287,073 | — | PNPLA5 | G/A | GLN>stop | 2 |
X | 2,832,668 | — | ARSD | A/G | Stop>GLN | 9, 15 |
55,185,663 | — | FAM104B | T/G | Splice Site | 15 | |
103,294,760 | rs2301384 | H2BFM | C/T | GLN>stop | 11, 13, 14 |
Chr . | Position . | rsID . | Gene . | DNA change . | AA change . | Families . |
---|---|---|---|---|---|---|
1 | 85,546,961 | — | WDR63 | G/T | GLU>stop | 6 |
148,932,885 | rs1048214 | LOC645166 | C/T | GLN>stop | 5 | |
152,277,622 | — | FLG | G/T | SER>stop | 9 | |
2 | 166,904,221 | — | SCN1A | G/T | TYR>stop | 16 |
3 | 40,231,748 | — | MYRIP | A/T | LYS>stop | 15 |
75,787,516 | — | ZNF717 | C/G | Splice Site | 4 | |
4 | 77,660,829 | rs73826426 | SHROOM3 | C/A | TYR>stop | 8, 10 |
6 | 49,425,475 | — | MUT | G/A | ARG>stop | 6 |
121,560,230 | — | C6orf170 | G/A | ARG>stop | 2 | |
168,226,602 | — | C6orf124 | C/T | TRP>stop | 7 | |
7 | 76,751,068 | rs71555938 | CCDC146 | G/C | TYR>stop | 16 |
10 | 123,846,924 | — | TACC2 | C/T | GLN>stop | 4 |
135,490,903 | rs36130162 | DUX1 | C/T | ARG>stop | 16 | |
11 | 1,857,515 | — | SYT8 | G/A | Splice Site | 14 |
12 | 4,870,307 | rs61758971 | GALNT8 | C/T | GLN>stop | 14 |
109,690,964 | — | ACACB | G/A | Splice Site | 14 | |
13 | 24,243,246 | — | TNFRSF19 | C/T | ARG>stop | 12 |
14 | 21,779,981 | — | RPGRIP1 | A/G | Splice Site | 16 |
58,832,019 | rs62621193 | ARID4A | G/A | Splice Site | 14 | |
15 | 75,562,499 | — | GOLGA6C | C/T | GLN>stop | 14 |
78,807,407 | — | AGPHD1 | T/A | TYR>stop | 14 | |
16 | 24,873,990 | — | SLC5A11 | G/A | TRP>stop | 2 |
84,495,318 | rs4782970 | ATP2C2 | A/C | Splice Site | 15 | |
17 | 45,234,277 | — | CDC27 | A/C | Splice Site | 13, 14 |
18 | 14,513,786 | rs8095431 | POTEC | T/C | Splice Site | 4 |
19 | 11,943,225 | — | ZNF440 | C/T | ARG>stop | 12 |
40,195,184 | — | LGALS14 | G/A | Splice Site | 16 | |
43,699,204 | — | PSG4 | C/A | GLU>stop | 4 | |
56,459,551 | — | NLRP8 | C/T | ARG>stop | 6 | |
20 | 2,597,716 | — | TMC2 | A/T | Splice Site | 1, 9 |
30,226,904 | — | COX4I2 | T/G | Splice Site | 14 | |
22 | 44,287,073 | — | PNPLA5 | G/A | GLN>stop | 2 |
X | 2,832,668 | — | ARSD | A/G | Stop>GLN | 9, 15 |
55,185,663 | — | FAM104B | T/G | Splice Site | 15 | |
103,294,760 | rs2301384 | H2BFM | C/T | GLN>stop | 11, 13, 14 |
There were 375 missense variants and 70 indels shared among affected family members after filtering. Three hundred fifty-eight of the missense SNVs (95%) and 62 of the indels (89%) were private, 10 missense SNVs and 8 indels were present in 2 families, and 7 missense SNVs were found in 3 families (Table 3). Two genes had 2 variants each (CTBP2 and MUC6); the variants were shared between the same families. In both genes, the variants were less than 50 bp apart and likely due to an inherited haplotype in the families. Seventeen genes had more than 1 missense SNV, including 6 variants in CDC27 (Supplementary Table S2). Private missense and indel variants are shown in Supplementary Tables S3 and S4, respectively.
Missense and indel variants shared in multiple families
Chr . | Position . | rsID . | Gene . | DNA change . | AA change . | Families . |
---|---|---|---|---|---|---|
1 | 117,142,736 | — | IGSF3 | A/G | ILE>THR | 1, 4, 12 |
145,293,515 | rs12565078 | NBPF10 | A/G | ASN>SER | 11, 15 | |
148,023,040 | — | NBPF14 | G/C | SER>CYS | 9, 7, 4 | |
154,171,908 | — | C1orf189 | TC/- | FS | 9, 15 | |
6 | 31,324,603 | rs66519358 | HLA-B | T/- | FS | 9, 10 |
7 | 76,619,625 | rs2302541 | PMS2P11 | C/T | ARG>CYS | 12, 14, 15 |
99,434,077 | rs61469810 | CYP3A43 | A/- | FS | 8, 11 | |
10 | 118,215,310 | — | PNLIPRP3 | -/A | FS | 4, 14 |
126,673,560 | — | ZRANB1 | -/A | FS | 5, 15 | |
126,678,112 | — | CTBP2 | T/C | ASN>SER | 2, 15 | |
126,678,148 | — | CTBP2 | G/C | ALA>GLY | 2, 15 | |
11 | 1,017,337 | — | MUC6 | T/C | THR>ALA | 4, 16 |
1,017,338 | — | MUC6 | C/A | GLN>HIS | 4, 16 | |
5,172,795 | — | OR52A1 | -/C | FS | 5, 11 | |
71,529,890 | — | FAM86C and DEFB108Ba | A/T | ILE>LYS | 6, 15 | |
12 | 9,581,791 | rs4763566 | DDX12 | T/C | LYS>GLU | 11, 13, 15 |
14 | 19,378,312 | rs61969158 | OR11H12 | T/G | VAL>GLY | 11, 13, 14 |
16 | 85,132,883 | — | FAM92B | A/C | PHE>VAL | 6, 10 |
19 | 41,622,107 | rs11399890 | CYP2F1 | -/C | FS | 10, 14 |
44,778,796 | — | ZNF233 | T/- | FS | 12, 13 | |
58,385,748 | — | ZNF814 | G/A | ALA>VAL | 9, 10 | |
22 | 18,846,088 | rs9605845 | GGT3P and DGCR6a | A/G | MET>THR | 8, 13 |
X | 2,832,715 | rs73632953 | ARSD | T/C | LYS>ARG | 9, 15 |
55,185,656 | rs5003001 | FAM104B | C/A | ARG>ILE | 8, 10, 15 | |
68,725,640 | rs1171942 | FAM155B | T/C | LEU>PRO | 10, 11, 14 |
Chr . | Position . | rsID . | Gene . | DNA change . | AA change . | Families . |
---|---|---|---|---|---|---|
1 | 117,142,736 | — | IGSF3 | A/G | ILE>THR | 1, 4, 12 |
145,293,515 | rs12565078 | NBPF10 | A/G | ASN>SER | 11, 15 | |
148,023,040 | — | NBPF14 | G/C | SER>CYS | 9, 7, 4 | |
154,171,908 | — | C1orf189 | TC/- | FS | 9, 15 | |
6 | 31,324,603 | rs66519358 | HLA-B | T/- | FS | 9, 10 |
7 | 76,619,625 | rs2302541 | PMS2P11 | C/T | ARG>CYS | 12, 14, 15 |
99,434,077 | rs61469810 | CYP3A43 | A/- | FS | 8, 11 | |
10 | 118,215,310 | — | PNLIPRP3 | -/A | FS | 4, 14 |
126,673,560 | — | ZRANB1 | -/A | FS | 5, 15 | |
126,678,112 | — | CTBP2 | T/C | ASN>SER | 2, 15 | |
126,678,148 | — | CTBP2 | G/C | ALA>GLY | 2, 15 | |
11 | 1,017,337 | — | MUC6 | T/C | THR>ALA | 4, 16 |
1,017,338 | — | MUC6 | C/A | GLN>HIS | 4, 16 | |
5,172,795 | — | OR52A1 | -/C | FS | 5, 11 | |
71,529,890 | — | FAM86C and DEFB108Ba | A/T | ILE>LYS | 6, 15 | |
12 | 9,581,791 | rs4763566 | DDX12 | T/C | LYS>GLU | 11, 13, 15 |
14 | 19,378,312 | rs61969158 | OR11H12 | T/G | VAL>GLY | 11, 13, 14 |
16 | 85,132,883 | — | FAM92B | A/C | PHE>VAL | 6, 10 |
19 | 41,622,107 | rs11399890 | CYP2F1 | -/C | FS | 10, 14 |
44,778,796 | — | ZNF233 | T/- | FS | 12, 13 | |
58,385,748 | — | ZNF814 | G/A | ALA>VAL | 9, 10 | |
22 | 18,846,088 | rs9605845 | GGT3P and DGCR6a | A/G | MET>THR | 8, 13 |
X | 2,832,715 | rs73632953 | ARSD | T/C | LYS>ARG | 9, 15 |
55,185,656 | rs5003001 | FAM104B | C/A | ARG>ILE | 8, 10, 15 | |
68,725,640 | rs1171942 | FAM155B | T/C | LEU>PRO | 10, 11, 14 |
aIntergenic variants, identified are the 2 closest genes.
Synonymous, intronic, and intergenic variants were the most abundant, with 1,394 shared among affected family members after filtering. Most of these were private (86%), whereas 191 were detected in 2 or more families. Over half of the variants were in genes without any other variant present (n = 837); the remaining 557 variants were found in 152 genes (range, 2–31 variants/gene). Summarizing across variant types, 46 genes had at least 4 variants (Table 4).
Genes with multiple variants shared in affected family members: number of variants
Gene(s) . | Nonsense and splice . | Missense . | Indel . | Other . | Total . |
---|---|---|---|---|---|
ZNF717 | 1 | — | — | 30 | 31 |
ANKRD30BL | — | — | — | 25 | 25 |
FRG1B, NCAPG2 | — | — | — | 18 | 18 |
CDC27 | 1 | 6 | — | 8 | 15 |
CTBP2 | — | 4 | — | 9 | 13 |
KIR2DS4 | — | — | — | 10 | 10 |
ROCK1P1, TTTY23/GYG2P1a | — | — | — | 9 | 9 |
HLA-DRB1 | — | 1 | 1 | 6 | 8 |
MST1P2, MUC12 | — | — | — | 8 | 8 |
ARSD | 1 | 1 | — | 5 | 7 |
ACHE, BAGE/BAGE4, KCNJ12, MST1P9 | — | — | — | 7 | 7 |
FAM104B | 1 | 2 | — | 4 | 7 |
BCL8, KIR3DL3, LOC642846, MUC3A, NBPF10, RACGAP1P | — | — | — | 6 | 6 |
AQP7P1, SIGLEC16, CROCCP2, FANK1, KIR2DL, 1 KRT16P2/TNFRSF13Ba, KRTAP5-4 | — | — | — | 5 | 5 |
C6orf10 | — | 2 | — | 3 | 5 |
NBPF12 | — | 3 | — | 2 | 5 |
CFTR | — | 3 | — | 1 | 4 |
ADAM6, HLA-DRB5, HLA-DRB6, HSP90AB2P, MED12, NBPF1, NBPF9, PCDHB17, POLA1, RPGR, TBC1D3P2, WASH2P | — | — | — | 4 | 4 |
Gene(s) . | Nonsense and splice . | Missense . | Indel . | Other . | Total . |
---|---|---|---|---|---|
ZNF717 | 1 | — | — | 30 | 31 |
ANKRD30BL | — | — | — | 25 | 25 |
FRG1B, NCAPG2 | — | — | — | 18 | 18 |
CDC27 | 1 | 6 | — | 8 | 15 |
CTBP2 | — | 4 | — | 9 | 13 |
KIR2DS4 | — | — | — | 10 | 10 |
ROCK1P1, TTTY23/GYG2P1a | — | — | — | 9 | 9 |
HLA-DRB1 | — | 1 | 1 | 6 | 8 |
MST1P2, MUC12 | — | — | — | 8 | 8 |
ARSD | 1 | 1 | — | 5 | 7 |
ACHE, BAGE/BAGE4, KCNJ12, MST1P9 | — | — | — | 7 | 7 |
FAM104B | 1 | 2 | — | 4 | 7 |
BCL8, KIR3DL3, LOC642846, MUC3A, NBPF10, RACGAP1P | — | — | — | 6 | 6 |
AQP7P1, SIGLEC16, CROCCP2, FANK1, KIR2DL, 1 KRT16P2/TNFRSF13Ba, KRTAP5-4 | — | — | — | 5 | 5 |
C6orf10 | — | 2 | — | 3 | 5 |
NBPF12 | — | 3 | — | 2 | 5 |
CFTR | — | 3 | — | 1 | 4 |
ADAM6, HLA-DRB5, HLA-DRB6, HSP90AB2P, MED12, NBPF1, NBPF9, PCDHB17, POLA1, RPGR, TBC1D3P2, WASH2P | — | — | — | 4 | 4 |
aIntergenic variants, identified are the 2 closest genes.
Pedigree structures of 5 families suggested recessive inheritance; these families were separately investigated to identify genes with homozygous variant alleles or compound heterozygosity (Supplementary Table S5). In family 2, 5 genes were identified with multiple variants. Three had only noncoding variants, whereas one (CTBP2) harbored 2 missense variants and a 5′-UTR variant and another gene (PDE4DIP) harbored 2 indels. In family 3, 3 genes had multiple variants. One gene contained only noncoding variants; the remaining genes had a missense and a noncoding variant (PYROXD1) or a missense variant and an indel (PTPN9). In family 6, 3 genes harbored multiple variants; however, all variants were noncoding. In family 11, 19 genes had multiple variants. Fourteen of these genes had only noncoding variants, one had 2 indels (HLA-DQA1), 2 had missense variants (DDX12, MUC2), and the remaining 2 had a combination of variants (NBPF10, ZNF717). In family 13, 11 genes harbored multiple variants; variants in 10 of the genes were noncoding, whereas GGT3P had one missense and one noncoding variant.
Technical validation of select variants
Thirty-one variants identified in 7 families were selected for technical validation and segregation studies by Sanger sequencing. Additional variants in the families were not tested because of the presence of homologous sequences, because the variant had been identified as a common sequencing error, or because the gene was hypervariable. Of the 31 variants tested, 27 were validated in the previously exome-sequenced individuals and 4 were found to be a false positives, including 2 nonsense variants (SCN1A and SHROOM3) and 2 indels (B3GNT6 and RBMX; Table 5).
Validation of identified variants
GRCh37 Chr:Position . | Gene name . | dbSNP130 . | Variant type . | Family . | Technical validation of exome sequenced CRC-affected carriers . |
---|---|---|---|---|---|
20:2,597,716 | TMC2 | — | Splice site | 1 | 2/2 |
4:100,130,075 | ADH6 | rs149932401 | Missense | 1 | 2/2 |
4:104,030,143 | CENPE | — | Missense | 1 | 2/2 |
19:48,735,017 | CARD8 | rs146319637 | Frameshift | 1 | 2/2 |
4:57,204,689 | AASDH | — | Frameshift | 1 | 2/2 |
6:121,560,230 | C6orf170 | — | Nonsense | 2 | 3/3 |
16:24,873,990 | SLC5A11 | — | Nonsense | 2 | 3/3 |
22:44,287,073 | PNPLA5 | — | Nonsense | 2 | 3/3 |
3:187,003,786 | MASP1 | — | Missense | 2 | 3/3 |
3:186,331,094 | AHSG | — | Missense | 2 | 3/3 |
12:70,088,219 | BEST3 | — | Frameshift | 2 | 3/3 |
11:64,543,927 | SF1 | rs34514973 | Frameshift | 2 | 3/3 |
19:55,327,891 | KIR3DL1 | rs71367103 | Upstream | 3 | 3/3 |
15:75,798,025 | PTPN9 | — | Frameshift | 3 | 3/3 |
4:77,660,829 | SHROOM3 | rs73826426 | Nonsense | 8, 9 | 0/3 |
13:24,243,246 | TNFRSF19 | — | Nonsense | 12 | 2/2 |
19:11,943,225 | ZNF440 | — | Nonsense | 12 | 2/2 |
19:44,778,796 | ZNF233 | — | Frameshift | 12 | 2/2 |
16:336,700 | PDIA2 | rs201624048 | Frameshift | 12 | 2/2 |
2:196,661,361 | DNAH7 | — | Frameshift | 12 | 2/2 |
7:128,587,351 | IRF5 | rs60344245 | Deletion | 12 | 2/2 |
7:15,601,409 | AGMO | — | Frameshift | 12 | 2/2 |
20:34,215,234 | CPNE1 | rs76294482 | Frameshift | 12 | 2/2 |
19:40,195,184 | LGALS14 | — | Nonsense | 16 | 3/3 |
2:166,904,221 | SCN1Aa | — | Nonsense | 16 | 0/3 |
14:21,779,981 | RPGRIP1 | — | Splice site | 16 | 3/3 |
15:69,732,770 | KIF23 | — | Missense | 16 | 3/3 |
3:178,960,766 | KCNMB3 | rs143962239 | Frameshift | 16 | 3/3 |
9:43,844,264 | CNTNAP3B | — | Frameshift | 16 | 3/3 |
11:76751,603 | B3GNT6a | — | Frameshift | 16 | 0/3 |
X:135,960,146 | RBMXa | — | Frameshift | 16 | 0/3 |
GRCh37 Chr:Position . | Gene name . | dbSNP130 . | Variant type . | Family . | Technical validation of exome sequenced CRC-affected carriers . |
---|---|---|---|---|---|
20:2,597,716 | TMC2 | — | Splice site | 1 | 2/2 |
4:100,130,075 | ADH6 | rs149932401 | Missense | 1 | 2/2 |
4:104,030,143 | CENPE | — | Missense | 1 | 2/2 |
19:48,735,017 | CARD8 | rs146319637 | Frameshift | 1 | 2/2 |
4:57,204,689 | AASDH | — | Frameshift | 1 | 2/2 |
6:121,560,230 | C6orf170 | — | Nonsense | 2 | 3/3 |
16:24,873,990 | SLC5A11 | — | Nonsense | 2 | 3/3 |
22:44,287,073 | PNPLA5 | — | Nonsense | 2 | 3/3 |
3:187,003,786 | MASP1 | — | Missense | 2 | 3/3 |
3:186,331,094 | AHSG | — | Missense | 2 | 3/3 |
12:70,088,219 | BEST3 | — | Frameshift | 2 | 3/3 |
11:64,543,927 | SF1 | rs34514973 | Frameshift | 2 | 3/3 |
19:55,327,891 | KIR3DL1 | rs71367103 | Upstream | 3 | 3/3 |
15:75,798,025 | PTPN9 | — | Frameshift | 3 | 3/3 |
4:77,660,829 | SHROOM3 | rs73826426 | Nonsense | 8, 9 | 0/3 |
13:24,243,246 | TNFRSF19 | — | Nonsense | 12 | 2/2 |
19:11,943,225 | ZNF440 | — | Nonsense | 12 | 2/2 |
19:44,778,796 | ZNF233 | — | Frameshift | 12 | 2/2 |
16:336,700 | PDIA2 | rs201624048 | Frameshift | 12 | 2/2 |
2:196,661,361 | DNAH7 | — | Frameshift | 12 | 2/2 |
7:128,587,351 | IRF5 | rs60344245 | Deletion | 12 | 2/2 |
7:15,601,409 | AGMO | — | Frameshift | 12 | 2/2 |
20:34,215,234 | CPNE1 | rs76294482 | Frameshift | 12 | 2/2 |
19:40,195,184 | LGALS14 | — | Nonsense | 16 | 3/3 |
2:166,904,221 | SCN1Aa | — | Nonsense | 16 | 0/3 |
14:21,779,981 | RPGRIP1 | — | Splice site | 16 | 3/3 |
15:69,732,770 | KIF23 | — | Missense | 16 | 3/3 |
3:178,960,766 | KCNMB3 | rs143962239 | Frameshift | 16 | 3/3 |
9:43,844,264 | CNTNAP3B | — | Frameshift | 16 | 3/3 |
11:76751,603 | B3GNT6a | — | Frameshift | 16 | 0/3 |
X:135,960,146 | RBMXa | — | Frameshift | 16 | 0/3 |
aVariants in bold were not validated and considered false positives.
Segregation analysis of validated variants
For all variants validated, additional affected and nonaffected family members, and family members with polyps were Sanger sequenced to determine cosegregation (Table 6). Only one variant (PTPN9) was not replicated in any of the new samples; others were replicated in 1 to 6 additional family members. Several of the variants seemed to segregate with affection status, such as TMC2, ADH6, CENPE, AASDH, C6orf170, AHSG, SF1, RPGRIP1, and KIF23. Particularly interesting are the variants in CENPE and KIF2. Both are very rare; the KIF23 variant is seen only once in the ESP database of European Americans, whereas the variant in CENPE is not present in any public database. Nonparametric LOD scores were calculated for validated SNVs. The maximum possible NPL score was less than 2.5 for all families (Supplementary Table S6). No variants had an observed NPL LOD score more than 1, possibly due to the few individuals and families with available data for analysis.
Replication and segregation of validated variants
. | . | . | . | . | Segregation . | ||
---|---|---|---|---|---|---|---|
GRCh37 Chr:Position . | Gene name . | dbSNP130 . | Variant type . | Family . | Additional CRC-affected carriers . | Unaffected carriers . | Polyp-affected carriers . |
20:2,597,716 | TMC2 | — | Splice site | 1 | 2/3a | 0/1 | 3/4 |
4:100,130,075 | ADH6 | rs149932401 | Missense | 1 | 3/3a | 0/1 | 3/4 |
4:104,030,143 | CENPE | — | Missense | 1 | 3/3a | 0/1 | 3/4b |
19:48,735,017 | CARD8 | rs146319637 | Frameshift | 1 | 2/3a | 1/1c | 2/4 |
4:57,204,689 | AASDH | — | Frameshift | 1 | 2/3a | 0/1 | 4/4 |
6:121,560,230 | C6orf170 | — | Nonsense | 2 | 0/0 | 0/2 | 4/7 |
16:24,873,990 | SLC5A11 | — | Nonsense | 2 | 1/1a | 1/2 | 5/7 |
22:44,287,073 | PNPLA5 | — | Nonsense | 2 | 1/1a | 1/2 | 3/7 |
3:187,003,786 | MASP1 | — | Missense | 2 | 0/0 | 1/2 | 2/7 |
3:186,331,094 | AHSG | — | Missense | 2 | 1/1a | 1/2 | 3/7 |
12:70,088,219 | BEST3 | — | Frameshift | 2 | 1/1a | 0/2 | 5/7 |
11:64,543,927 | SF1 | rs34514973 | Frameshift | 2 | 1/1a | 1/2 | 4/7 |
19:55,327,891 | KIR3DL1 | rs71367103 | Upstream | 3 | 2/2 | 1/1 | n/a |
15:75,798,025 | PTPN9 | — | Frameshift | 3 | 0/2 | 0/1 | n/a |
13:24,243,246 | TNFRSF19 | — | Nonsense | 12 | 1/1 | 1/4 | 1/1 |
19:11,943,225 | ZNF440 | — | Nonsense | 12 | 1/1 | 0/4 | 0/1 |
19:44,778,796 | ZNF233 | — | Frameshift | 12 | 1/1b | 3/4 | 1/1 |
16:336,700 | PDIA2 | rs201624048 | Frameshift | 12 | 1/1 | 1/4 | 0/1 |
2:196,661,361 | DNAH7 | — | Frameshift | 12 | 1/1 | 1/4 | 1/1 |
7:128,587,351 | IRF5 | rs60344245 | Deletion | 12 | 1/1 | 4/4 | 1/1 |
7:15,601,409 | AGMO | — | Frameshift | 12 | 1/1 | 1/4 | 0/1 |
20:34,215,234 | CPNE1 | rs76294482 | Frameshift | 12 | 0/1 | 3/4 | 0/1 |
19:40,195,184 | LGALS14 | — | Nonsense | 16 | 0/1 | 1/3 | n/a |
14:21,779,981 | RPGRIP1 | — | Splice site | 16 | 1/1 | 0/3 | n/a |
15:69,732,770 | KIF23 | — | Missense | 16 | 1/1 | 0/3 | n/a |
3:178,960,766 | KCNMB3 | rs143962239 | Frameshift | 16 | 0/1 | 2/3 | n/a |
9:43,844,264 | CNTNAP3B | — | Frameshift | 16 | 1/1 | 3/3 | n/a |
. | . | . | . | . | Segregation . | ||
---|---|---|---|---|---|---|---|
GRCh37 Chr:Position . | Gene name . | dbSNP130 . | Variant type . | Family . | Additional CRC-affected carriers . | Unaffected carriers . | Polyp-affected carriers . |
20:2,597,716 | TMC2 | — | Splice site | 1 | 2/3a | 0/1 | 3/4 |
4:100,130,075 | ADH6 | rs149932401 | Missense | 1 | 3/3a | 0/1 | 3/4 |
4:104,030,143 | CENPE | — | Missense | 1 | 3/3a | 0/1 | 3/4b |
19:48,735,017 | CARD8 | rs146319637 | Frameshift | 1 | 2/3a | 1/1c | 2/4 |
4:57,204,689 | AASDH | — | Frameshift | 1 | 2/3a | 0/1 | 4/4 |
6:121,560,230 | C6orf170 | — | Nonsense | 2 | 0/0 | 0/2 | 4/7 |
16:24,873,990 | SLC5A11 | — | Nonsense | 2 | 1/1a | 1/2 | 5/7 |
22:44,287,073 | PNPLA5 | — | Nonsense | 2 | 1/1a | 1/2 | 3/7 |
3:187,003,786 | MASP1 | — | Missense | 2 | 0/0 | 1/2 | 2/7 |
3:186,331,094 | AHSG | — | Missense | 2 | 1/1a | 1/2 | 3/7 |
12:70,088,219 | BEST3 | — | Frameshift | 2 | 1/1a | 0/2 | 5/7 |
11:64,543,927 | SF1 | rs34514973 | Frameshift | 2 | 1/1a | 1/2 | 4/7 |
19:55,327,891 | KIR3DL1 | rs71367103 | Upstream | 3 | 2/2 | 1/1 | n/a |
15:75,798,025 | PTPN9 | — | Frameshift | 3 | 0/2 | 0/1 | n/a |
13:24,243,246 | TNFRSF19 | — | Nonsense | 12 | 1/1 | 1/4 | 1/1 |
19:11,943,225 | ZNF440 | — | Nonsense | 12 | 1/1 | 0/4 | 0/1 |
19:44,778,796 | ZNF233 | — | Frameshift | 12 | 1/1b | 3/4 | 1/1 |
16:336,700 | PDIA2 | rs201624048 | Frameshift | 12 | 1/1 | 1/4 | 0/1 |
2:196,661,361 | DNAH7 | — | Frameshift | 12 | 1/1 | 1/4 | 1/1 |
7:128,587,351 | IRF5 | rs60344245 | Deletion | 12 | 1/1 | 4/4 | 1/1 |
7:15,601,409 | AGMO | — | Frameshift | 12 | 1/1 | 1/4 | 0/1 |
20:34,215,234 | CPNE1 | rs76294482 | Frameshift | 12 | 0/1 | 3/4 | 0/1 |
19:40,195,184 | LGALS14 | — | Nonsense | 16 | 0/1 | 1/3 | n/a |
14:21,779,981 | RPGRIP1 | — | Splice site | 16 | 1/1 | 0/3 | n/a |
15:69,732,770 | KIF23 | — | Missense | 16 | 1/1 | 0/3 | n/a |
3:178,960,766 | KCNMB3 | rs143962239 | Frameshift | 16 | 0/1 | 2/3 | n/a |
9:43,844,264 | CNTNAP3B | — | Frameshift | 16 | 1/1 | 3/3 | n/a |
NOTE: All individuals with available DNA in each family (excluding the original WES samples) were tested for each variant (family 1, n = 8; family 2, n = 10; family 3, n = 3; family 12, n = 6; family 16, n = 5). Only results of successful sequencing are reported in the table.
aIncludes one obligate carrier.
bAt least 1 individual is homozygous for the variant.
cHas stomach cancer.
Search of the known susceptibility genes and regions also identifies CENPE and KIF23
In addition to the agnostic search for novel loci, we investigated 27 known or suspected high-risk and familial CRC genes and several candidate regions. We required that all affected family members shared the variants as in our previous analyses. However, we did not exclude any variants beyond that, as the genes and regions we were targeting are well-documented risk regions and we did not want to overlook any potential candidate variants. Our selection of non-MMR families was effective—no shared variants were observed in MLH1, MLH3, MSH2, MSH3, MSH6, or PMS1. Two SNPs in PMS2 were identified in families 5, 6, and 9 (Supplementary Table S7). However, both SNPs were common and expected to be tolerated. BRCA2 had missense (n = 3) and synonymous or intronic (n = 3) variants with a MAF between 1% and 5%; 5 of the SNVs were found only in family 12, increasing the likelihood that the region containing the variants was inherited as a haplotype block. MCC harbored 2 SNVs, a missense variant resulting in a glycine-to-arginine substitution in family 13, and a noncoding variant found in 5 families. BRCA1 harbored a GLN to ARG substitution (rs1799950) predicted to be damaging; the same rare allele has been associated with a decreased risk of developing breast cancer, however, it has not been described in colon cancer previously (50). APC, AXIN2, GALNT12, MYH11, and TP53 each had one noncoding variant. No SNVs or indels were shared among affected family members in the remaining 12 HCC genes (Supplementary Table S7).
Previously, we reported 4 regions linked to CRC with heterogeneity logarithm of odds scores greater than 3.0 in 356 families, including 15 of the currently studied families (12). Two regions, 4q21.1 and 15q22.31, harbored variants (Supplementary Table S8). The 4q21.1 region contained 5 shared variants, including the SHROOM3 nonsense variant, which was found to be a false positive, and 4 noncoding variants. The 15q22.31 region contained 2 missense variants (CGNL1 and KIF23) and 5 noncoding variants. No variants were found in the other linkage regions examined. Family-specific linkage analysis yielded LOD scores more than 1.3 in 10 regions in 5 of the families sequenced (Supplementary Table S9). None of the regions contained a gene with a shared nonsense or splice variants and 6 of the regions harbored only noncoding variants. The linkage peak on chromosome 4 harbored 2 missense SNVs in family 1, one each in ADH6 and CENPE. In family 2, 2 linkage peaks on chromosomes 1 and 3 harbored missense variants in WDR47, AHSG, and MASP1. The linkage peaks in family 5 contained 2 missense variants (CFTR and ZC3HC1) and 2 noncoding variants. In addition, although it is expected that variants responsible for disease in densely affected families differ from modest penetrance variants, we investigated SNVs within the ±500 kb regions surrounding SNPs shown to be associated in GWAS with CRC risk (14, 15, 19, 21–27). One indel variant was found in TPD52L3 within the 9p24 region in family 5; however, this family does not carry the identified risk allele at rs719725. Twenty-one missense and 45 synonymous or noncoding variants were also identified in the GWAS regions, however, many were common (MAF > 5%) and not likely to contribute to CRC genesis (Supplementary Table S10). Thus, we were able to identify 2 variants of interest (CENPE and KIF23) in the regions previously implicated in CRC risk that were also identified by our earlier agnostic search. These 2 variants were validated and replicated in the affected families, strengthening the results.
Discussion
This analysis of whole-exome sequence data in 16 high-risk CRC families shows the use of massively parallel exome sequencing to identify novel candidate genes for complex diseases. We have enumerated potential novel variants as well as those in prior candidate genes and regions. After excluding variants not shared among affected family members, common variants (MAF ≥ 0.01), and those expected to be benign or tolerated, several remained, including protein-truncating mutations in genes involved in cell shape and motility (ZRANB1), mitosis (CDC27, CENPE, DDX12, HAUS6/FAM29A, HIST1H2BE, KIF23, TACC2, and ZC3HC1), transcription regulation (CTBP2, IRF5, MED12, RNF111, SF1, TLE1, TLE4, TRIP4), and the immune response (BTNL2, BAGE, CARD8, FANK1, KIR2DL1, KIR2DS4, KIR3DL3, MASP1, and NLRP8; refs. 51–77), as well as numerous missense and indel variants.
It is likely that some of the identified genes are causal. We divided the variants into 3 categories, based on the likelihood of causing a loss of protein or protein function: the most likely to be causal (nonsense and splice site), those with an elevated risk of being deleterious (missense) and those with the lowest likelihood of being damaging (synonymous and noncoding). Given the lengthy list of candidate genes, the possibility of false-positive results, and the paucity of functional information, additional targeted sequencing studies in a large set of independent cases and controls is warranted. Targeted sequencing of novel candidate genes (e.g., those with nonsense or splice-site variants) in at least 1,000 familial CRC cases would be an informative next step.
This study represents only one point in the journey of identifying genetic predisposition to CRC. CRC is highly heterogeneous and polygenic; unlike the very distinctive Mendelian diseases for which whole-exome sequencing has been successful. Studies directed at identifying candidate susceptibility genes for familial CRC are not readily yielding causal variants (78–81). We debated family selection strategies. Without defined criteria about the optimal selection methods, identification of families and individuals best suited for exome sequencing proved more challenging than expected. We based selection on what we believed would maximize the chance of including families with a high-risk genetic predisposition based on widely held tenets: multiple, closely related affected relatives and younger ages at diagnosis (49, 82). To investigate this further, we compared the observed proportion of shared variants to the expected proportion of shared variants. Significant increases in nonsense, missense, and indel variants (Supplementary Table S11) strengthened our belief that the methods for choosing families were suitable.
For our families, we chose to sequence 2 individuals when distantly related cases were available and 3 individuals when only closely related individuals (siblings) were available, following the recommendations of Feng and colleagues for studying complex diseases by sequencing (49). Model-based approaches, such as estimation of expected LOD scores, could also have been used (83). Missing information on earlier generations meant that sequenced samples may be connected through unaffected relatives (in one avuncular pair, additional data became available confirming this to be the case), showing the challenges of incomplete penetrance and phenocopies in studying cancer and complex traits. Sequencing of nonaffected family members to help distinguish between causal and benign variants was also discussed; however, the penetrance for CRC in families that met Amsterdam criteria (84), but do not have MMR defects (type X), is lower than for Lynch Syndrome (85). This makes sequencing for unaffected relatives less useful, compared with disorders with complete or very high penetrance.
Our study has weaknesses. First, it involves only a small number of families, chosen to include those with no evidence of Lynch syndrome, MUTYH mutations, or familial adenomatous polyposis. Several other candidate families were considered, however, funding was only available for a small number of families. As sequencing costs decline, combining with other collections will be more feasible and needed to identify additional genes of interest. Second, for each family, only a limited number of affected individuals were available for sequencing. The relationships of those selected were preferentially chosen to be cousins or avuncular. However, for several families, the only individuals with DNA available were siblings, reducing power to detect causal variants. Third, we had a false discovery rate of approximately 13%, which is higher than expected on the basis of previous studies. This may in be due to the fact that rare variants, such as the ones we choose to validate, have a lower rate of validation than more common variants (86), highlighting the critical need for Sanger validation. It is interesting to note that the 3 false-positive variants identified were all in the same family (family 12), which was the only one using the 50 Mb capture system. Multiple factors may have contributed to the false positives identified in this family, including degraded DNA for the individuals tested, increased target size, resulting in localized areas of decreased coverage, or misalignment due to poor probe design. Fourth, samples were not all subject to the same capture or sequencing conditions, resulting in increased coverage for samples sequenced toward the study's end. It is possible that some variants detected in later families were present in the earlier families, but went undetected, skewing our perceptions of the allele frequencies. Differences in capture and sequencing technologies also likely affect public databases; numerous variants identified had little available frequency information. Finally, the most appropriate method to filter for causal variants in complex diseases is unknown. We first narrowed the number of variants by filtering those not shared among the affected family members, which may have excluded causal variants that do not perfectly cosegregate with disease. We used several strategies, including examining candidate genes and regions, looking for genes with multiple variants, and agnostic searching for novel loci. We restricted our search to rare variants, hypothesizing that genes important for the development of CRC will harbor several private variants.
In summary, we have completed exome sequencing of 40 familial CRC cases from 16 families and identified and technically validated several candidate CRC variants. Follow-up studies to determine the frequency of variants in many of the identified genes are currently underway. Further sequencing and functional studies will be needed to confirm the identified genes and determine their role in the genesis of CRC.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Disclaimer
The content of this article does not necessarily reflect the views or policies of the National Cancer Institute or any of the collaborating centers in the CFRs, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government or the CFR.
Authors' Contributions
Conception and design: M.A. Jenkins, R.W. Haile, M.O. Woods, S.N. Gallinger, J.D. Potter, S.N. Thibodeau, E.L. Goode
Development of methodology: M.S. DeRycke, S.M. Riska, B.W. Eckloff, M.A. Jenkins, M.O. Woods, S.N. Thibodeau
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): S.R. Gunawardena, B.W. Eckloff, J.M. Cunningham, M.S. Cicek, D. Buchanan, M. Clendenning, R.W. Haile, S.N. Gallinger, G. Casey, J.D. Potter, P.A. Newcomb, L. Le Marchand, N.M. Lindor, S.N. Thibodeau, E.L. Goode
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): M.S. DeRycke, S.R. Gunawardena, S. Middha, Y.W. Asmann, D.J. Schaid, S.K. McDonnell, S.M. Riska, B.L. Fridley, D.J. Serie, M.S. Cicek, D.J. Duggan, M. Clendenning, R.W. Haile, S.N. Thibodeau, E.L. Goode
Writing, review, and/or revision of the manuscript: M.S. DeRycke, S.R. Gunawardena, S. Middha, Y.W. Asmann, S.K. McDonnell, J.M. Cunningham, M.S. Cicek, M.A. Jenkins, D.J. Duggan, D. Buchanan, M. Clendenning, R.W. Haile, M.O. Woods, S.N. Gallinger, G. Casey, J.D. Potter, P.A. Newcomb, L. Le Marchand, N.M. Lindor, S.N. Thibodeau, E.L. Goode
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): M.S. DeRycke, S.R. Gunawardena, W.R. Bamlet, N.M. Lindor, S.N. Thibodeau
Study supervision: S.N. Thibodeau, E.L. Goode
Grant Support
This work was supported by the National Cancer Institute, NIH under RFA # CA-95-011 and through cooperative agreements with members of the Colon CFR and Principal Investigators. Collaborating centers include the Australian Colorectal CFR (UO1 CA097735), the Familial Colorectal Neoplasia Collaborative Group (UO1 CA074799), Mayo Clinic Cooperative Family Registry for Colon Cancer Studies (UO1 CA074800), Ontario Registry for Studies of Familial Colorectal Cancer (UO1 CA074783), and University of California, Irvine Informatics Center (UO1 CA078296).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.