Distinguishing between driver and passenger somatic mutations to pinpoint genetic alterations leading to oncogenesis still presents significant challenges. To meet these challenges, computational tools have been developed as effective filters, pruning most of the somatic mutations to a shortlist of high-priority, functional candidates for experimental validation. Most tools include searching for genes or pathways having mutation rates higher than explained by chance, mutations in conserved regions, or genes with neighboring mutations on the linear DNA or protein sequence. Recently, there has been a shift to utilize tertiary/quaternary protein structures to identify mutations clustering proximal to each other in 3D space. Such enrichment of mutations can indicate specific domains critical to normal protein function and when mutated, can drive tumor initiation and progression. HotSpot3D, a protein structure-based tool, identifies clusters enriched with proximal mutations within proteins. Though HotSpot3D has been valuable in identifying clusters of residues that are important to cancer, it does not distinguish the driving potential or structural impact of different mutations within a cluster nor does it consider the physical impact of different amino acid substitutions at the same site. The prediction power of HotSpot3D in distinguishing driver mutations from passenger mutations can be improved if spatial clustering considers physical/biological features proximal to mutations in significant clusters as well as the specific amino acid substitutions of mutations. We have created a machine learning algorithm that further prioritizes putative driver mutations found in HotSpot3D clusters by incorporating structural/biological features such as proximity of mutations to functional sites (active sites, phosphorylation sites, disulfide bonds, etc.), solvent accessibility, physiochemical property change of mutations, free energy change of mutations, conservation of residue sites, secondary structure state of residue sites, and expression/phosphorylation changes of samples containing mutations. We have curated experimentally validated mutations identified as neutral or oncogenic from various databases to serve as our training sets. This algorithm can be trained on the curated mutations in various protein subclasses such as homologous proteins, oncogenes, tumor suppressors, etc. to identify distinct structural feature signatures per subclass specific to driver mutations. This tool will aid in revealing putative driver mutations in genes not previously linked with cancer and help pinpoint mutations in known cancer genes that are driving cancer. Specifically, we are interested in applying the algorithm to druggable protein families such as G-Protein Coupled Receptors, Kinases, and Nuclear Hormone Receptors to better understand their role in tumor initiation and progression.

Citation Format: Sohini Sengupta, Adam Scott, Amila Weerasinghe, Dan C. Zhou, Matthew A. Wyczalkowski, Reyka G. Jayasinghe, Ken Chen, Gordon Mills, Mike C. Wendl, John Dipersio, Li Ding. Utilizing biological and protein structure-guided features to improve driver mutation discovery [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 2357.