Cancer is a remarkably adaptable and formidable foe. Cancer exploits many biological mechanisms to confuse and subvert normal physiologic and cellular processes, to adapt to therapies, and to evade the immune system. Decades of research and significant national and international investments in cancer research have dramatically increased our knowledge of the disease, leading to improvements in cancer diagnosis, treatment, and management, resulting in improved outcomes for many patients.
In melanoma, the V600E mutation in the BRAF gene is now targetable by a specific therapy. BRAF is a serine/threonine protein kinase activating the MAP kinase (MAPK)/ERK signaling pathway, and both BRAF and MEK inhibitors, such as vemurafenib and dabrafenib, have shown dramatic responses in patients carrying the mutation. However, even these successes led to new questions. The same mutation in colorectal cancer is resistant to BRAF inhibitors, suggesting that this mutation interacts in a complex way with other elements in the cell and those interactions may be cell lineage dependent. Exploring mechanisms of resistance to these inhibitors has led to a better understanding of the MAPK/ERK signaling pathway, which in turn has led to identification of new potential therapeutic targets.
The accelerating progress in cancer research has been driven by rapid developments in technology. We have seen profound advances in sequencing technology, in techniques for assaying proteins and metabolites, in imaging capabilities, and in establishing electronic health records. At the same time, advances in mobile computing, pervasive availability of the Internet, and social media have opened new possibilities for understanding the contributing factors leading to cancer as well as the outcomes of treatment at a population scale across the country and even the world. But we often lose sight of the fact that all of these technologies produce only one thing: data. This unprecedented influx of data has only allowed us to advance our understanding because of advances in data management, analysis, and interpretation, all of which are increasingly recognized as essential elements of an integrated cancer research program.
Indeed, these national initiatives have served to increase the awareness of and need for robust investment in both cancer informatics development and in training the next generation of cancer data scientists. The Beau Biden Cancer Moonshot (https://www.cancer.gov/brp) identified enhanced data sharing as one of three key elements necessary to accelerate cancer research. The Precision Medicine Initiative highlighted the importance of data-driven cancer research, translational research, and its application to decision making in cancer treatment (https://www.cancer.gov/research/key-initiatives/precision-medicine). And the National Strategic Computing Initiative highlighted the importance of computing as a national competitive asset and included a focus on applying computing in biomedical research. Articles in the mainstream media, such as that by Siddhartha Mukherjee in the New Yorker in April of 2017 (http://www.newyorker.com/magazine/2017/04/03/ai-versus-md), have emphasized the growing importance of computing, machine learning, and data in biomedicine.
The NCI (Rockville, MD) recognized the need to invest in informatics. In 2011, it established a funding opportunity, Informatics Technology for Cancer Research (ITCR) (https://itcr.cancer.gov), designed to support new algorithms, new methodologies, and the maturation of tools and techniques necessary to harness the power of data and computation for cancer researchers. Since its inception, the ITCR has funded 49 applications that support cancer informatics in areas that include DNA sequence analysis and interpretation, extraction of information from clinical records, systems biology and network medicine, proteomics, metabolomics, emerging fields such as radiomics, and application of machine learning and artificial intelligence to a host of research problems. A guiding principle of the ITCR program is that projects address relevant needs in cancer research so that resultant algorithms and tools provide value not only to data scientists, but also to the broader community of basic, clinical, and translational scientists.
In addition to the ITCR, the NCI also launched the Cancer Genomics Cloud Pilot (CGCP) program to address problems associated with the scope and scale of modern cancer data. Although a single human genome sequence can be represented in about 300 megabytes of disk space, the terabytes and petabytes of information that modern cancer studies generate make transporting and replicating data across thousands of research laboratories infeasible. The CGCP was designed to take advantage of modern, robust, scalable cloud computing technologies by storing data in a commercial cloud infrastructure and allowing cancer researchers to bring their methods to the data to perform analyses.
Advancing our understanding of cancer also requires that we share data and establish cohorts that are large enough to draw meaningful conclusions. The NCI established the Genomic Data Commons (GDC, https://gdc.cancer.gov), positioning it as a central resource for sharing genomic, imaging, proteomic, and phenotype (clinical data for human specimens) information. The GDC includes defined data standards (https://gdc.cancer.gov/about-data/data-standards) to help assure consistent, harmonized access to data together with well-characterized primary analyses for various types of genomic data. This includes whole-genome sequencing, whole-exome sequencing, deep targeted sequencing, RNA-seq, methyl-seq, and other sequence-centric datasets. The GDC went live in June of 2016 and currently provides centralized access to large genomically focused datasets, including The Cancer Genome Atlas (TCGA) and TARGET. In addition, sequencing data from 18,000 FoundationOne tests, the Multiple Myeloma Research Foundation Compass study, and the AACR Project GENIE dataset will soon be available through the GDC.
Together, the ITCR, GDC, and CGCP programs represent an investment in the future of data-driven cancer research. This special issue of Cancer Research is designed to highlight some of the resources and discoveries that have been made possible by these NCI programs. Most of the articles appearing here are short “application notes” introducing a tool or resource and providing a short vignette demonstrating its application. Each research team was also asked to provide a brief video that could be included online to either provide more background or to serve as a brief tutorial. In addition, research articles demonstrate the utility of these tools in gaining new insight into cancer.
We hope that these collected works provide readers of Cancer Research with new tools that they can incorporate into their work, either to explore existing public datasets such as TCGA or to analyze data that they are generating. We also hope that these articles provide incentive for broader collaboration between cancer data scientists and laboratory, translational, and clinical scientists. Cancer is a complex disease, and conquering it will require bringing all our collective skills to bear. Further, we expect that this issue will be the beginning of many more computational resource papers that will be published and highlighted in future issues of Cancer Research.
Disclosure of Potential Conflicts of Interest
J. Quackenbush is the co-founder and former board chair at Genospace, LLC. No potential conflicts of interest were disclosed by the other authors.