Abstract
Researchers have developed a machine learning method that could help advance research on tumorigenesis. Using large databases of human tumors, the team developed machine learning models that can identify driver and passenger mutations in specific cancer genes and determine the location and key features of cancer drivers.
Using large databases of human tumors, a team of researchers recently developed machine learning models that can identify driver and passenger mutations in specific cancer genes and determine the location and key features of cancer drivers (Nature 2021;596:428–32). The researchers used this information to compose blueprints of potential driver mutations in cancer genes that could improve understanding of tumorigenesis.
Núria López-Bigas, PhD, ICREA research professor at the Institute for Research Biomedicine in Barcelona, Spain, and her team have been identifying somatic mutations in cancer genomes stored in large databases and using this information to create a compendium of cancer genes (Nat Rev Cancer 2020;20:555–72). However, they have been unable to differentiate driver and passenger mutations or learn more about the alterations. Differentiating the mutations and determining how often they occur and where they are located on a gene “are key to understanding the role of these genes in cancer,” López-Bigas says. “We'd like to identify which mutations in [a] gene drive tumorigenesis and learn which features define a driver mutation in that gene and that tissue.”
To this end, the team analyzed somatic mutations in 28,000 human tumors across 66 cancer types and used the mutations to identify 568 cancer genes involved in tumorigenesis. “We wanted to learn directly from human tumor data because they are natural experiments of tumorigenic mutations run many times,” López-Bigas explains. The researchers then built 185 tissue-specific machine learning models that were trained on sets of mutations containing either mostly driver or mostly passenger mutations. They found that the models, which together form their “BoostDM” method of in silico saturation mutagenesis, could differentiate driver and passenger mutations in specific cancer genes and tissue types.
The models were also able to classify rare mutations in oncogenes—and the models for TP53, KRAS, NRAS, HRAS, and PTEN performed better than experimental saturation mutagenesis assays and existing computational methods. The models also pinpointed the most important features of driver mutations and where mutations were located on the genes—enabling the researchers to create gene blueprints that map “hotspots” where driver mutations are clustered. These hotspots may reveal more about how the mechanisms of tumorigenesis differ for the same gene in different tissues. For example, the models indicated that EGFR driver mutations are clustered in different places for lung cancer and glioblastoma.
“What I see as most important is the demonstration that if we have enough data, good quality models can be created with this type of approach,” López-Bigas says, adding that she wants to build additional models as more genomic information becomes available. She also aims to add context to the models beyond tissue type—for example, the treatments patients received and whether mutations co-occur.
López-Bigas and her team have made the data and models freely available to researchers at https://www.intogen.org/boostdm/search. They also used what they learned from BoostDM to update Cancer Genome Interpreter, a tool they previously developed to assist in interpreting mutations of uncertain significance in cancer genes (https://www.cancergenomeinterpreter.org/home).
“I think it's a great advance—it's a foundational proof-of-concept study,” says Samuel Aparicio, PhD, of The University of British Columbia in Canada, who was not involved in the work. Historically, Aparicio says, researchers have lacked understanding of the genes and mutational processes that give rise to cancer. “This method gives us some idea of whether there are specific features associated with specific genes,” he adds. “It broadens our knowledge, potentially, of what constitutes a cancer driver.”
However, more work is needed to fully understand the function of the driver mutations identified in the study. For example, Aparicio envisions follow-up experiments in which researchers alter the genes containing these mutations in different ways and analyze the outcomes. He is also curious to know whether the approach can be applied to other aspects of the genome, such as noncoding sequences or epigenetic variations. “It's a little less clear exactly how one would extend this into those spaces, but there are many creative ways to think about it.” –Catherine Caruso
For more news on cancer research, visit Cancer Discovery online at http://cancerdiscovery.aacrjournals.org/CDNews.