Risk of cancer is a complex function of exposures (external and endogenous) and genetic susceptibility to these exposures. For each exposure, the function of multiple pathways may be relevant. As pathways can have dozens of genes and most genes have multiple variant single nucleotide polymorphisms (SNPs), 100s of SNPs are candidate susceptibility factors. Discerning the variants that affect function and cancer susceptibility is a major challenge. The goal of this study was identification of polymorphisms in base excision repair (BER) and antioxidant (AO) genes that affect the amount of DNA damage in untreated cells, primarily damage from oxidative metabolism, and the ability to repair damage induced by exposure to ionizing radiation (IR). Using the alkaline Comet assay we quantified single strand breaks and abasic sites in untreated cells and in cells immediately and 15 minutes after exposure to 5Gy IR, in 80 lymphoblastoid cell lines of the DNA Polymorphism Discovery Resource (DPDR). Comet distributed moment was the parameter used to quantify damage. Genotypes of the cell lines, known from extensive resequencing studies of the DPDR, included 174 variants of 34 BER genes (159 amino acid substitution SNPs and 15 SNPs in upstream ∼150bp sequences) and 17 amino acid substitution SNPs of 5 AO genes. Random Forests Regression (RFR) was used to identify SNPs that were important for predicting 2 phenotypes: endogenous (background) damage and % damage repaired 15 min after IR. Using a stringent cutoff for importance, there were 5 SNPs from 4 genes that were most influential for background damage (from greatest impact to least, POLD1 Arg19His, XRCC1 Arg399Gln, XRCC1 Val72Ala, LIG3 Arg780His, RPA4 Ala33Thr) and 7 influential SNPs for repair of IR damage (RPA4 Ala33Thr, XRCC1 Val72Ala, MPG 5 prime UTR, PNKP Pro20Ser, RAD23B Ala249Val, POLE Phe695Ile, LPO Thr105Ile). It is notable that the frequencies of these important SNPs range from 0.01 to 0.42 and that the SIFT scores of these variants independently predict an impact on protein function. In summary, RFR efficiently identified 10 out of 191 polymorphisms that help predict two phenotypes relevant to cancer risk. This study demonstrates the ability to use heuristic machine learning techniques to identify functionally important genetic variation by analyzing relationships between phenotypes and complex genotypes. This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48.

[Proc Amer Assoc Cancer Res, Volume 45, 2004]