Abstract
The role of DNA repair in initiation, promotion, and progression of malignancy suggests that variations in DNA repair genes confer altered cancer risk. Accordingly, DNA repair gene variants have been studied extensively in the context of cancer predisposition. Single nucleotide polymorphisms (SNPs) are the most common genetic variations in the human genome. A fraction of SNPs are located within the genes, which are likely to alter the gene expression and function. SNPs that change the encoded amino acid sequence of the proteins (non-synonymous; nsSNPs) are potentially genetic disease determinant variations. However, as not all amino acid substitutions are supposed to lead to a change in protein function, it will be necessary to have a priori prediction and determination of the functional consequences of amino acid substitutions per se, and then together with other genetic and environmental factors to study their possible association with a trait. Here we report the analysis of nsSNPs in 88 DNA repair genes and their functional evaluation based on the conservation of amino acids among the protein family members. Our analysis demonstrated that >30% of variants of DNA repair proteins are highly likely to affect the function of the proteins drastically. In this study, we have shown that three nsSNPs, which were predicted to have functional consequences (XRCC1-R399Q, XRCC3-T241M, XRCC1-R280H), were already found to be associated with cancer risk. The strategy developed and applied in this study has the potential to identify functional protein variants of DNA repair pathway that may be associated with cancer predisposition.
Introduction
Nuclear DNA is under constant DNA damage stress induced by both endogenous (such as reactive oxygen species) and exogenous sources (such as irradiation). Proper recognition and repair of the DNA damage are essential for normal homeostasis and functioning of multicellular organisms (1, 2). DNA repair activities are maintained by the presence of five different DNA damage sensor and repair mechanisms (homologous recombinational repair, non-homologous end-joining, nucleotide excision repair, base excision repair, and mismatch repair). Defects in the DNA repair pathways are often associated with excessive cell death (by apoptosis) or transformation of the cells (1, 2), and variations in DNA repair genes were hypothesized to modify individual and population cancer risk (3).
To date, much success has been obtained in the identification of high-penetrant cancer predisposition genes using linkage analysis. However, the challenge that has remained is to identify those alleles conferring low to moderate cancer risk. It is hypothesized that genetic variation contributes to the susceptibility for complex traits such as cancer (4–6). Molecular epidemiological and genetic approaches use single nucleotide polymorphisms (SNPs) in the human genome to study disease susceptibility. Because genome-wide scans are still challenging, often candidate gene/pathway approach may prove more efficient. Due to presence of enormous number of SNPs, systematic prioritization on the basis of biological function and relevance to cancer will accelerate the identification of such susceptibility alleles (4).
The most common form of genetic variation in the human genome is the SNPs (5–8). SNPs are relatively stably inherited genomic variations with an estimated density of 1 in 1000 bp. SNPs are usually bi-allelic, their occurrence rates vary across the genomic regions, and their allelic frequencies may differ among ethnic groups. A fraction of SNPs alter the encoded amino acid sequence (non-synonymous SNPs; nsSNPs), and have the potential to affect the structure, function, and interactions of proteins. Thus, nsSNPs are excellent candidates for candidate-gene association studies (7). However, not all nsSNPs are anticipated to have functional consequences; it is essential to develop strategies to select the variations that may alter and disrupt the proper functions of the proteins. Studying the functional consequences of genetic variants has been challenging due to the enormous number of variants present in the genome. Although there is an increasing effort for establishing in vivo functional strategies for studying the effects of variants, it is still far from being available for a large number of variants of interest. Recently, several approaches have been developed and used to study the nature of the genetic variants (9–15). Among these, computational tools provide an efficient and high-throughput source for in vivo functional analyses and/or population studies. SIFT (Sorting Intolerant From Tolerant) (10, 11) is a powerful tool that predicts the functional importance of an amino acid based on the alignment of highly similar proteins (either orthologous or paralogous or both) with the protein of interest. The predictions rely on whether or not an amino acid is conserved (or substituted by only a similar amino acid) in the protein family, which can suggest its importance for the function/structure of the protein.
Here, using the public SNP databases, we have identified a wide range of DNA repair nsSNPs, and we have carried out a computational study to characterize the evolutionary importance of these DNA repair nsSNPs. This study has the potential to provide a pool of functional SNPs, which may play important roles in the predisposition to cancer as well as other DNA repair-associated genetic diseases.
Methods
Database Mining for SNPs
The list of DNA repair pathway genes studied was obtained from the CGAP-GAI web-site2
Internet address: http://lpgws.nci.nih.gov/html-cgap/cgl/DNA_damage.html).
Internet address: http://www.ncbi.nlm.nih.gov/SNP/.
Internet address: http://hgvbase.cgb.ki.se/.
Internet address: http://lpgws.nci.nih.gov/.
Internet address: http://snp500cancer.nci.nih.gov/home.cfm.
Internet address: http://www.genome.utah.edu/genesnps/.
M. Edmenson, K. Buetow. The BLAST against gene transcripts tool (unpublished). Internet address: http://lpgws.nci.nih.gov:80/perl/blast2.
Internet address: http://www.ncbi.nlm.nih.gov/BLAST/.
Internet address: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene.
Mutation Data Set
Mutations with known functional consequences were retrieved from the SWISS-PROT database11
Internet address: http://us.expasy.org/sprot/.
Evolutionary Conservation Analysis
Protein conservation analysis was performed using the SIFT12
Internet address: http://blocks.fhcrc.org/sift/SIFT_seq_submit2.html.
Statistical Analysis
The statistical analyses were done using a χ2 test (22). We applied the Yates correction for approximation of 2 × 2 tables. The test was conducted at the α = 0.05 level of significance. This test was applied to examine possible significant differences of the evolutionary conservation status of the amino acids altered in mutation and DNA repair nsSNP data sets, and between the rare and common DNA repair nsSNPs.
Results
We have compiled a total of over 1000 SNP entries from 88 DNA repair genes using five web-based public SNP databases (see “Methods”). Extensive manual inspection of all SNP entries have shown at least one gene-specific nsSNP in 51.1% (45 of 88) of the proteins (a total of 150 nsSNPs resulting in an amino acid substitution). Four of the nsSNPs were unique to the CGAP-GAI database. There was no nsSNP unique to the SNP500 database. Most of the nsSNPs were found in dbSNP (n = 128, 85.3%), GeneSNP (n = 105, 70.0%), and HGVbase (n = 89, 59.3%). The average number of nsSNPs for genes with at least one nsSNP was 3.3. Among all the genes studied, ATM was found to have the highest number of nsSNPs (n = 19).
In this study, we have used a modified interpretation of the SIFT algorithm results to define the nature of the variations (see “Methods”). To determine the sensitivity of the modified SIFT interpretation, we have used a panel of 231 missense mutations supported with functional evidence (see “Methods”; Table 1). Except one mutation, the number of proteins in all the alignments was at least six or higher (n = 230). Mutations in this group were predicted as either damaging (57.39%) or possibly damaging (19.13%), whereas 17.83% and 5.65% of the mutations were predicted either tolerated or possibly tolerated, respectively. Thus, the sensitivity of the modified SIFT predictions (damaging together with possibly damaging) reported in this study was 76.52%.
SIFT predictions . | Mutations (n = 230) . | nsSNPs* (n = 106) . | Validated nsSNPs† (n = 68) . |
---|---|---|---|
. | n (%) . | n (%) . | n (%) . |
Damaging | 132 (57.39) | 11 (10.38) | 5 (7.36) |
Possibly damaging | 44 (19.13) | 28 (26.41) | 15 (22.06) |
Possibly tolerated | 13 (5.65) | 39 (36.80) | 30 (44.11) |
Tolerated | 41 (17.83) | 28 (26.41) | 18 (26.47) |
SIFT predictions . | Mutations (n = 230) . | nsSNPs* (n = 106) . | Validated nsSNPs† (n = 68) . |
---|---|---|---|
. | n (%) . | n (%) . | n (%) . |
Damaging | 132 (57.39) | 11 (10.38) | 5 (7.36) |
Possibly damaging | 44 (19.13) | 28 (26.41) | 15 (22.06) |
Possibly tolerated | 13 (5.65) | 39 (36.80) | 30 (44.11) |
Tolerated | 41 (17.83) | 28 (26.41) | 18 (26.47) |
Note: This table contains the variations (mutations and nsSNPs) for which a reliable SIFT prediction was available (≥6 similar proteins in the alignment).
*Includes all the nsSNPs independent of their validation status.
†Includes the validated SNP only.
We have also applied the modified SIFT predictions to study our panel of 150 nsSNPs involved in DNA repair genes. In 44 of 150 variants, the predictions were based on the alignment of less than six sequences, which was considered inconclusive (NP nsSNPs). Reliable predictions were obtained in 106 (70.6%) nsSNPs, and the results are depicted in Table 1. Within this group, 11 (10.37%) nsSNPs were predicted to be damaging the protein function. Twenty-eight of the 106 variants (26.41%) were predicted as possibly damaging, indicating that they are likely to have functional consequences as well. On the other hand, 67 nsSNPs (63.2%) were predicted either tolerated or possibly tolerated by our SIFT analysis. We have found that SIFT detects a significantly higher number of damaging alterations (including the possibly damaging alterations) in the mutation panel as compared to the DNA repair nsSNP panel (P < 0.0001) (Table 1).
Frequency information of 102 (68.0%) of 150 nsSNPs13
A few number of nsSNPs were screened in population(s) but could not be detected: we still report them as there was a chance that these nsSNPs could not be validated because they may represent either ethnic group specific or rare nsSNPs.
SIFT prediction . | Rare nsSNPs (≤5%) (n = 78)* . | Common nsSNPs (>5%) (n = 17)* . | Rare nsSNPs (≤5%) (n = 49)† . | Common nsSNPs (>5%) (n = 13)† . |
---|---|---|---|---|
. | n (%) . | n (%) . | n (%) . | n (%) . |
NP | 29 (37.18) | 4 (23.52) | ||
Damaging | 4 (5.13) | 0 (0.00) | 4 (8.16) | 0 (0.00) |
Possibly damaging | 11 (14.11) | 3 (17.64) | 11 (22.46) | 3 (23.08) |
Tolerated | 12 (15.38) | 5 (29.41) | 12 (24.49) | 5 (38.46) |
Possibly tolerated | 22 (28.20) | 5 (29.41) | 22 (44.89) | 5 (38.46) |
SIFT prediction . | Rare nsSNPs (≤5%) (n = 78)* . | Common nsSNPs (>5%) (n = 17)* . | Rare nsSNPs (≤5%) (n = 49)† . | Common nsSNPs (>5%) (n = 13)† . |
---|---|---|---|---|
. | n (%) . | n (%) . | n (%) . | n (%) . |
NP | 29 (37.18) | 4 (23.52) | ||
Damaging | 4 (5.13) | 0 (0.00) | 4 (8.16) | 0 (0.00) |
Possibly damaging | 11 (14.11) | 3 (17.64) | 11 (22.46) | 3 (23.08) |
Tolerated | 12 (15.38) | 5 (29.41) | 12 (24.49) | 5 (38.46) |
Possibly tolerated | 22 (28.20) | 5 (29.41) | 22 (44.89) | 5 (38.46) |
Note: n stands for number of SNPs. The percentages of the SIFT predictions within rare and common nsSNPs are given within parentheses.
*All validated nsSNPs regardless of their SIFT results.
†All validated nsSNPs with reliable SIFT predictions.
In case of rare nsSNPs, we predicted 4 nsSNPs as damaging and 11 as possibly damaging (Table 3). Our results have also shown that none of the 17 SNPs with allelic frequencies of 5% and higher were predicted to be damaging, whereas 3 of them (IGHMBP2-T671A; XRCC1-R399Q; XRCC3-T241M) were predicted to be possibly damaging (Table 3). The two nsSNPs, ERCC4-P379S (HGVbase SNP ID: SNP000000067; Ref. 23), and XRCC1-R280H (SNP000000031/rs25489; see also GeneSNP entry) variants were predicted as damaging and possibly damaging by SIFT analysis, respectively, though the reported minor allele frequencies were inconsistent (Table 3).
Gene symbol . | SNP ID . | nsSNP . | Frequency range* . | SIFT prediction . |
---|---|---|---|---|
ATM | rs1800059 | S1691R | s | possibly damaging |
ATM | rs1800060 | V2079I | s | possibly damaging |
ATM | rs1800061 | G2287A | s | possibly damaging |
ATM | rs1137889 | N3003D | s | possibly damaging |
ERCC1 | rs3188420 | P77H | s | possibly damaging |
ERCC3 | SNP000063371/rs1805162 | G402C | 1 | damaging |
ERCC4 | SNP000064450/rs2020961 | A168V | 1 | damaging |
ERCC4 | SNP000000067 | P379S | 1/2† | damaging |
ERCC4 | SNP000002737 | I706T | 1 | possibly damaging |
ERCC4 | SNP000002795 | E875G | 1 | possibly damaging |
ERCC5 | rs1047769 | M254V | 1 | possibly damaging |
FANCA | SNP000002991/rs1800282 | V6D | s | possibly damaging |
FANCC | SNP000003086/rs1800364 | L190F | s | possibly damaging |
FANCC | SNP000003087/rs1800365 | D195V | s | possibly damaging |
IGHMBP2 | SNP000012785/rs622082 | T671A | 2 | possibly damaging |
LIG1 | GAI 876498 | P884R | s | damaging |
LIG3 | SNP000010631/rs1802880 | D592V | s | damaging |
LIG4 | rs2232640 | E461G | s | possibly damaging |
MLH1 | SNP000002820/rs1800149 | L729V | 1 | possibly damaging |
MLH1 | SNP000064598/rs2020873 | H718Y | 1 | possibly damaging |
NTHL1 | SNP000064449/rs1805378 | I176T | 1 | damaging |
NTHL1 | SNP001026567 | D239Y | 1 | damaging |
PCNA | GAI 864449 | Q38H | s | damaging |
PCNA | rs1050525 | S39R | s | damaging |
POLB | E0448_302 | P242R | 1 | possibly damaging |
RAD23A | rs2242518 | Q261R | s | damaging |
RAD50 | rs3187395 | E925K | s | possibly damaging |
RAD51 | rs1056742 | K313Q | s | damaging |
TOP2A | SNP000012935/rs1804539 | S1471F | s | possibly damaging |
WRN | SNP001026663/rs3087414 | S1079L | 1 | possibly damaging |
XRCC1 | SNP001026358/rs2307186 | R7L | 1 | possibly damaging |
XRCC1 | SNP000064197/rs25496 | V72A | 1 | possibly damaging |
XRCC1 | SNP001026365/rs2307191 | P161L | 1 | possibly damaging |
XRCC1 | SNP000000031/rs25489 | R280H | 1/2† | possibly damaging |
XRCC1 | rs2271980 | V381M | s | possibly damaging |
XRCC1 | SNP000000032/rs25487 | R399Q | 2 | possibly damaging |
XRCC3 | SNP000000060 | T241M | 2 | possibly damaging |
XRCC3 | SNP000064617/rs1805380 | L463F | 1 | possibly damaging |
XRCC3 | GAI 891410 | P485L | s | possibly damaging |
Gene symbol . | SNP ID . | nsSNP . | Frequency range* . | SIFT prediction . |
---|---|---|---|---|
ATM | rs1800059 | S1691R | s | possibly damaging |
ATM | rs1800060 | V2079I | s | possibly damaging |
ATM | rs1800061 | G2287A | s | possibly damaging |
ATM | rs1137889 | N3003D | s | possibly damaging |
ERCC1 | rs3188420 | P77H | s | possibly damaging |
ERCC3 | SNP000063371/rs1805162 | G402C | 1 | damaging |
ERCC4 | SNP000064450/rs2020961 | A168V | 1 | damaging |
ERCC4 | SNP000000067 | P379S | 1/2† | damaging |
ERCC4 | SNP000002737 | I706T | 1 | possibly damaging |
ERCC4 | SNP000002795 | E875G | 1 | possibly damaging |
ERCC5 | rs1047769 | M254V | 1 | possibly damaging |
FANCA | SNP000002991/rs1800282 | V6D | s | possibly damaging |
FANCC | SNP000003086/rs1800364 | L190F | s | possibly damaging |
FANCC | SNP000003087/rs1800365 | D195V | s | possibly damaging |
IGHMBP2 | SNP000012785/rs622082 | T671A | 2 | possibly damaging |
LIG1 | GAI 876498 | P884R | s | damaging |
LIG3 | SNP000010631/rs1802880 | D592V | s | damaging |
LIG4 | rs2232640 | E461G | s | possibly damaging |
MLH1 | SNP000002820/rs1800149 | L729V | 1 | possibly damaging |
MLH1 | SNP000064598/rs2020873 | H718Y | 1 | possibly damaging |
NTHL1 | SNP000064449/rs1805378 | I176T | 1 | damaging |
NTHL1 | SNP001026567 | D239Y | 1 | damaging |
PCNA | GAI 864449 | Q38H | s | damaging |
PCNA | rs1050525 | S39R | s | damaging |
POLB | E0448_302 | P242R | 1 | possibly damaging |
RAD23A | rs2242518 | Q261R | s | damaging |
RAD50 | rs3187395 | E925K | s | possibly damaging |
RAD51 | rs1056742 | K313Q | s | damaging |
TOP2A | SNP000012935/rs1804539 | S1471F | s | possibly damaging |
WRN | SNP001026663/rs3087414 | S1079L | 1 | possibly damaging |
XRCC1 | SNP001026358/rs2307186 | R7L | 1 | possibly damaging |
XRCC1 | SNP000064197/rs25496 | V72A | 1 | possibly damaging |
XRCC1 | SNP001026365/rs2307191 | P161L | 1 | possibly damaging |
XRCC1 | SNP000000031/rs25489 | R280H | 1/2† | possibly damaging |
XRCC1 | rs2271980 | V381M | s | possibly damaging |
XRCC1 | SNP000000032/rs25487 | R399Q | 2 | possibly damaging |
XRCC3 | SNP000000060 | T241M | 2 | possibly damaging |
XRCC3 | SNP000064617/rs1805380 | L463F | 1 | possibly damaging |
XRCC3 | GAI 891410 | P485L | s | possibly damaging |
Note: SNP, rs, and GAI prefixes stand for the SNP identifiers in HGVbase, dbSNP, and CGAP-GAI databases, respectively.
*1: nsSNPs with minor allele frequencies of ≤5%, 2: nsSNPs with minor allele frequencies of >5%.
†nsSNPs, the minor allele frequencies of which were reported as either ≤5% and >5% in independent SNP submissions.
Discussion
To enrich the SNP information for each gene studied, we have used five different public SNP databases. While dbSNP and HGVbase contained SNP information related to almost all kinds of genes, the SNP500, CGAP-GAI, and GeneSNP databases particularly focused on candidate genes/pathways that may play role in cancer susceptibility. Majority of the nsSNPs (97.33%) were found in the dbSNP, HGVbase, and GeneSNP databases. All nsSNPs reported here were curated using a highly stringent SNP extraction procedure to eliminate false annotations of the SNPs. Although SNP mining sensitivity is reduced following such a stringent procedure, we strongly suggest evaluating the SNP information using the same or similar approaches described in this study to increase the specificity of the curated data.
Among 1000 entries in the SNP databases, we have extracted a total of 150 nsSNPs resulting in an amino acid substitution from 51.1% (45 of 88) of the DNA repair genes analyzed in this study. The number of SNPs in these genes is likely to improve as more SNPs are discovered, and the SNP databases continue to be updated. Several factors may lead to underestimation of the number of SNPs in genes of interest. For example, a considerable number of SNPs in these databases is not validated to distinguish them from sequencing errors, and thus these nsSNPs represent “suspected” or “non-proven” SNPs. In terms of suspected SNPs, which are described based on the DNA/RNA sequence alignments, there may be a bias toward the genetic variations through the 3′ end of the transcripts as well as for abundant transcripts, common variations, and variations in less complex regions of the genome (24–26). Therefore, sequencing of the entire coding region of the genes of interest in significant number of DNA samples may reveal additional SNPs in the genes. Sequencing might especially help to demonstrate whether these genes found to have no nsSNPs during this study are really devoid of nsSNPs or not. This information could be useful for assessing conservation status of the genes, or the different mutation/recombination rates at genomic regions containing the genes of interest (7, 8, 26).
Protein conservation analyses based on the alignment of similar proteins (either among species or within species) can reveal those amino acids that are important for the function and probably for the structure of the protein families. Although such analyses would not indicate newly evolved critical amino acids with a particular function, or amino acid which are under positive selection under today's conditions, it may still be critical in assigning evolutionary conserved residues along the proteins. SIFT (10, 11) is an automated tool that calculates the conservation scores of each amino acid residue along the given protein sequences. Originally, the prediction sensitivity of SIFT for damaging amino acid substitutions was found to be 69% (10). Our SIFT predictions reported in this paper differ in some aspects from what Ng and Henikoff (11) did. First, in this study, we have modified the SIFT predictions by only considering predictions that are based on at least six protein sequences in the alignment at the amino acid position of interest. Second, whenever the median sequence conservation was >3.25, Ng and Henikoff (11) did not accept any predictions (a median sequence conservation score >3.25 indicates that the proteins in the alignment did not diverge yet, and thus the predictions would not be reliable as much as the predictions obtained from alignment of the diverged proteins where conserved residues are more easily identified) (11). However, considering the fact that 19.03% of the mutations were also found with median sequence conservation scores of >3.25 (Table 1), we preferred to include such predictions in our results, only stating that they were either “possibly tolerated” or “possibly damaging.”
The sensitivity of the modified SIFT prediction system was tested on a mutation set with experimentally determined functional consequences (see “Methods”). According to our results, it can be concluded that approximately 57.39% of the mutations occurred at amino acids that are conserved within the protein family in our set (median sequence conservation score 2.75–3.25). On the other hand, 19.03% of the mutations occurred either at regions of proteins that are highly conserved, or in the proteins for which homologous proteins from only close species were available (median sequence conservation score >3.25). Further analyses may be performed to investigate the latter possibility. The mutations that were not detected by SIFT as damaging could be those that occurred at query specific functional residues or are the variations in linkage disequilibrium with yet unidentified causative mutations (10). As far as DNA repair genes are concerned, over one third of the nsSNPs turned out to be likely to have functional consequences (i.e., found damaging and possibly damaging). Eleven DNA repair nsSNPs were found damaging, suggesting that they are excellent candidates for disease-predisposition studies. Another 28 nsSNPs were predicted as possibly damaging. We suggest that along with the damaging SNPs, these possibly damaging nsSNPs may also be good candidates for functional and association studies.
We were not able to make predictions for 44 DNA repair nsSNPs, due to the lack of sufficient sequence information available from homologous proteins (<6 proteins in the alignment at the position of the nsSNPs). As these analyses are based on the availability of the similar proteins in the public databases, we believe that as the number of curated proteins increases in protein databases, the predictions will become possible for these nsSNPs, and the reliability of the predictions for other nsSNPs will also improve.
Classification of the proven (validated) nsSNPs based on allele frequencies showed that only 16.2% of the nsSNPs was presented in the population(s) with an allele frequency of >5%, suggesting that most of the nsSNPs presented here are actually rare nsSNPs. These nsSNPs may be rare because they are either under negative selection, or newly evolved and thus not fixed in the population yet. None of the common nsSNPs investigated in this study were found to be truly damaging, whereas three of them were predicted to be possibly damaging (Table 3). We were unable to find any published reports regarding the analysis of the IGHMBP2-T671A variant, which was found to be possibly damaging in this study. IGHMBP2 (immunoglobulin μ binding protein 2) protein is presumably involved in a variety of cellular functions such as immunoglobulin-class switching, pre-mRNA processing, and transcription, and mutations in this protein have been shown to result in a neurodegenerative disease (27). On the other hand, the XRCC1-R399Q and XRCC3-T241M variants were intensively studied in the context of cancer association. XRCC1-R399Q SNP was shown to be associated with altered breast (28, 29) and lung (30) cancer risk. XRCC3-M241T has also been shown to confer increased risk to breast cancer14
J. C. Figueiredo, J. A. Knight, L. Briollais, I. L. Andrulis, H. Ozcelik. Polymorphisms XRCC1-R399Q and XRCC3-T241M and the risk of breast cancer at the Ontario site of the breast cancer family registry, in press.
J. C. Figueiredo, N. Diaz-Granados, J. A. Knight, S. Savas, L. Briollais, H. Ozcelik. XRCC1-R399Q and XRCC3-T241M: a systematic review of biological importance and role in cancer, in preparation.
Mutations that reduce the fitness of the individuals will be subject to purifying selection that eventually eliminate the mutations from the gene pool of a population, and thus never reach high frequencies (38), unless they confer a selective advantage because of a disease resistance in carriers of such mutations (39). Therefore, we analyzed the common and rare DNA nsSNPs for their conservation status. As a result, we could not detect any statistically significant difference (P < 0.0001, Table 3). Thus, it is tempting to speculate that some deleterious nsSNPs with moderate-high frequencies do not reduce the fitness of the individuals. In this context, the nature of such proteins with deleterious variations can be explained by either (a) the protein's function can be compensated by other proteins, (b) the protein's function is required only under certain environmental exposures/conditions, or (c) the protein is a rapidly evolving one, thus accumulating more mutations without affecting the fitness of the individual. Alternatively, these new substitutions may be either neutral or even positively selected. Analysis of a much larger data set will be helpful to fully characterize frequency-conservation status relation of genetic variations.
Genetic variation has been suggested to alter disease-susceptibility risk. SNPs being the most common variation in the human genome have been extensively studied in the context of disease predisposition. SNPs that alter important molecular features such as the expression, function, structure, stability, and interaction of candidate proteins are excellent candidates to study a possible association/direct involvement of a SNP and a phenotypic expression. However, both the presence of an enormous number of SNPs and the search for biologically relevant SNPs in candidate gene approaches require the application of reliable and logical selection systems. Here we presented results obtained using a highly stringent SNP mining strategy and a modified version of the previously developed SIFT tool to select DNA repair nsSNPs that are conserved within the protein family. Our results suggest that more than one third of the nsSNPs in the DNA repair genes are likely to have functional consequences. These nsSNPs are excellent candidates for cancer association as well as for experimental functional studies. In addition, these genetic variations are likely to be critical in studies aiming to elucidate the disparity in cancer-treatment responses among patients as well as to improve the effectiveness of the cancer treatments (40).
Grant support: Grant (BCTR0100627) from Susan Komen Breast Cancer Foundation, USA and “CIHR Strategic Training Program Grant—The Samuel Lunenfeld Research Institute Training Program: Applying Genomics to Human Health” fellowship (S. Savas).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Acknowledgments
We thank the groups that have developed the databases and the web-based tools used in this study. We are indebted to Michael Edmenson and Pauline Ng for their invaluable assistance with the Blast against gene transcripts and SIFT tools, respectively.