Since 2014, the NCI has launched a series of data commons as part of the Cancer Research Data Commons (CRDC) ecosystem housing genomic, proteomic, imaging, and clinical data to support cancer research and promote data sharing of NCI-funded studies. This review describes each data commons (Genomic Data Commons, Proteomic Data Commons, Integrated Canine Data Commons, Cancer Data Service, Imaging Data Commons, and Clinical and Translational Data Commons), including their unique and shared features, accomplishments, and challenges. Also discussed is how the CRDC data commons implement Findable, Accessible, Interoperable, Reusable (FAIR) principles and promote data sharing in support of the new NIH Data Management and Sharing Policy.

See related articles by Brady et al., p. 1384, Pot et al., p. 1396, and Kim et al., p. 1404

The completion of the Human Genome Project in 2003 ushered in an unprecedented era of growth and discovery in individualized medicine. As the nation's preeminent driver of cancer research, the NCI has been at the forefront of precision medicine funding and research, thereby generating petabytes of genomic, transcriptomic, epigenomic, proteomic, and imaging data. To maximize the government's investment in cancer research, the NCI has developed a series of cloud-based Data Commons (DC), known collectively as the Cancer Research Data Commons (CRDC), to collect, analyze, and share data from NIH-funded biomedical research and clinical studies. Data commons collocate data, storage, and computing infrastructure with core services and commonly used tools and applications for managing, analyzing, and sharing data to create an interoperable resource for the research community (1). The CRDC is a key component of the national cancer data ecosystem that was developed to enable stakeholders across the cancer research and care continuum to contribute and exchange data as part of a learning health care system. In contrast to more traditional download-centric data repositories, the CRDC collocates data with highly scalable cloud computing infrastructure and analysis tools, thereby making it possible to share data without the need for large data downloads, which can present a resource challenge to many researchers.

CRDC intends to serve the entire cancer research community. CRDC serves data submitters by providing submission guides, templates, and data dictionaries; serves developers by sharing source code in GitHub repositories; serves data users by improving user interfaces and providing multiple data access mechanisms such as downloading, online analysis and cloud computing to meet the needs of a spectrum of user groups such as clinicians, oncologists and informatics experts. The DCs have grown organically to meet the needs of the research community. The Genomic Data Commons (GDC) is the first DC in CRDC, focusing on sharing data from major genomic studies such as The Cancer Genome Atlas (TCGA). This was followed by the Proteomic Data Commons (PDC) to share data from proteomic studies such as Clinical Proteomic Tumor Analysis Consortium (CPTAC). Data type–specific DCs provide a unique set of tools and features to meet the unique needs of a given data type. With the growing number of DCs going live, there is a concerted effort to centralize shared DC services such as data standards, data models, indexing, and system security to improve overall efficiency and interoperability within CRDC (2, 3). In this review, we introduce each DC by highlighting its accomplishments, data, tools, and challenges, followed by cross-cutting topics and interoperability. We will also discuss the impact of the newly implemented NIH's Data Management and Sharing policy on the CRDC.

Below we introduce each DC, including available data types and tools, and highlight accomplishments and challenges. Key features of each DC are summarized in Table 1. For each DC, there is a dedicated Tools section describing a set of specialized tools and web resources. We also provide Supplementary Data (Supplementary Table S1), listing all tools and web resources for all DCs.

Table 1.

Key features of each data commons.

Key features of each data commons.
Key features of each data commons.

The GDC (https://portal.gdc.cancer.gov/; ref. 4) project began in May 2014 and launched in June 2016. To date, the GDC has released 8.83 petabytes (PB) of data from 44K+ participants and 22 programs. GDC's user base has grown over the last five years from an average of 40K+ unique visitors per month in 2018 to 70K+ unique visitors per month in 2022, spanning more than 90 countries.

Accomplishments

Notable accomplishments include:

  • (i) Provided uniform workflows supporting DNA, RNA, and miRNA alignments against a common reference genome (GRCh38)

  • (ii) Standardized analytic pipelines generate point mutations, small indels, DNA structural variations, and DNA copy-number changes. The data processed through these workflows/pipelines are harmonized, enabling cross study analyses

  • (iii) Provided data access tools such as the GDC Data Portal for interactively exploring and accessing data, the GDC Data Transfer Tool (DTT) for downloading large data sets, and the GDC Application Programming Interface (API) for programmatic access

  • (iv) Developed a submission system for uploading data to the GDC using data standards defined in the GDC Data Dictionary, which maintains 700+ clinical, biospecimen, and molecular properties

  • (v) Provided access to information and supplementary files from publications associated with NCI programs for which data is maintained in the GDC

Additional information on the GDC is available on the GDC documentation site (https://docs.gdc.cancer.gov/).

Data

The GDC works closely with experts in the cancer research community to uniformly process raw sequence data and apply state-of-the-art methods for generating higher level data. Both raw and higher level data are available via the GDC Application Programming Interface (API) and data portal. Examples of raw sequencing data include Binary Alignment Map (BAM) files from whole-genome sequencing (WGS), whole-exome sequencing (WXS), bulk RNA sequencing (RNA-seq), single-cell RNA-seq (scRNA-seq) and miRNA sequencing (miRNA-seq) platforms. Examples of higher level data include raw variant calls (Variant Call Format, VCF), masked somatic variant calls (Mutation Annotation Format, MAF), DNA structural variations, DNA copy-number variations, gene expression quantifications, splice junction quantifications, and transcript fusions. GDC also hosts methylation array data, slide image data, as well as associated clinical and biospecimen data.

Tools

Data made available through the GDC allows researchers and clinicians to study how genomic features affect clinical outcomes using the following approaches:

  • (i) Researchers can download data from GDC to perform scientific use case-driven analysis or use the web-based analysis tools to view mutation frequency by cancer types, plot high-impact mutations, visualize mutations for protein-coding regions, perform survival analysis, perform cohort comparisons and much more

  • (ii) GDC data made available in the cloud are analyzed using tools provided by the Cloud Resources (5)

Highlights and challenges

The GDC is currently in a redesign phase with an emphasis on a cohort-centric design that will allow researchers to define custom cohorts for use across analysis tools. Key updates of the redesign will include an Analysis Tool Framework (ATF) for application interoperability, tool modularization, and new scientific analysis tools. These updates will allow third party scientific analysis tools to operate within the GDC thus expanding the cancer knowledge network. In addition to the redesign, the GDC plans to provide additional workflows for WGS data.

While the current GDC data model supports longitudinal data, challenges include improving the ability to explore and analyze longitudinal data using the GDC Data Portal.

The PDC (https://proteomic.datacommons.cancer.gov/pdc/; ref. 6) project began in September 2017 and launched in March 2020 with the goal of providing open access to cancer-related proteomic datasets. Furthermore, the PDC also facilitates connections to complementary multiomic datasets (genomics and imaging data), all of which are derived from accompanying samples. The PDC primarily hosts mass spectrometry–based proteomic data generated from large consortia such as CPTAC, International Cancer Proteogenomics Consortium (ICPC), and Applied Proteogenomics Organizational Learning and Outcomes (APOLLO). Since launch, the PDC has released approximately 37 TB of data from more than 3,000 participants and 130+ studies. Data sets include proteome, phosphoproteome, glycoproteome, acetylome, and ubiquitylome data using data-dependent acquisition (DDA) or data-independent acquisition (DIA) mass spectrometry–based approaches (7, 8), including links to accompanying genomic and imaging data. The PDC consistently attracts an average of 5,000 users per month across more than 150 countries globally.

Accomplishments

Notable accomplishments include:

  • (i) Established links to external resources such as GDC, Imaging Data Commons (IDC), the Cancer Imaging Archive (TCIA), and the database of Genotypes and Phenotypes (dbGaP), providing convenient access to complementary omics data for individual studies and cases within multiomic programs

  • (ii) Created a searchable publication page showcasing studies featuring PDC data, complete with links to related studies and Supplementary Data

  • (iii) Developed a dedicated Pan-Cancer Analysis Page, providing easy access to publications, data, and supplementary materials from the CPTAC (https://proteomics.cancer.gov/programs/cptac) programs' comprehensive proteogenomic characterization of prevalent cancer types, achieved through extensive proteomic and genomic analysis

Data

PDC distributes multiple types of files, including those submitted by the original data submitters and harmonized data generated through the PDC Common Data Analysis Pipeline (CDAP; ref. 9). Raw data include both mass spectrometer specific proprietary format and HUPO Proteome Standards Initiative compliant mzML format. In addition, PDC also releases peptide spectrum matches, protein assembly, and supplementary data such as descriptive protocols, as well as harmonized clinical, biospecimen, experimental metadata, and other useful information.

Tools

The PDC tools and resources allow exploration of cohorts of cancer patients from multiple programs, including:

  • (i) An interactive web portal for easy data exploration, complemented by a GraphQL-based API interface for efficient programmatic access

  • (ii) Protein identification and quantitation data from the CDAP visualized using Morpheus (https://software.broadinstitute.org/morpheus/), a versatile heatmap viewer, allowing hierarchical clustering using comprehensive clinical metadata

  • (iii) Access to all PDC data is available through all three CRDC Cloud Resources (5): Seven Bridges’ Cancer Genomics Cloud (SB-CGC), Broad's FireCloud, and the Institute of Systems Biology's Cancer Gateway in the Cloud (ISB-CGC); eliminating the need for data download, and streamlining the analysis process

  • (iv) A variety of proteomics tools including jBrowse (https://pdc.cancer.gov/jbrowse), a tool for exploring proteomic data in the context of clinical and genomics data (10); PepQuery (http://pepquery2.pepquery.org), to identify and validate known and novel peptides of interest (11); and cProSite for comparing protein abundance between tumors and normal adjacent tissues (https://cprosite.ccr.cancer.gov)

Highlights and challenges

PDC has made significant progress in increasing interoperability with all of the NCI Cloud Resources to reduce the need for data downloading and enhance the overall speed and scalability of data analysis workflows. In addition, to continue to support the international user community, PDC has also developed a Data Download Client tool to improve data access.

Challenges include mitigating the costs of a growing amount of data downloads while continuing to encourage data utilization. As such, the PDC encourages the use of analytical and visualization tools for data analysis in the cloud in part by placing limits on excessive downloads.

The ICDC (https://caninecommons.cancer.gov/#/) project began in September 2018 and launched in August 2020 to further research on human cancers by enabling comparative analysis with canine cancers via access to pet canine health care and clinical trial data.

Accomplishments

  • (i) Released 11 studies from three programs, including NCI's Comparative Oncology Program, comprising 600+ canine cases and 900+ samples for a total of approximately 35 TBs of data

  • (ii) Enabled real-time interoperability between the ICDC and the IDC and TCIA, increasing findability of imaging data for canine cases

Data

One important aspect of the ICDC data is that all data is open access, including aligned sequencing data. Examples of this data include BAM files from WGS, WXS, RNA-seq, as well as DNA methylation sequencing (Methyl-Seq). Other data types include pathology reports, clinical data and study protocols. Some studies include supplementary data provided by data submitters such as pharmacokinetic data, cell line information, charts, graphs, sequencing metrics, and other useful information.

Tools

Below are key tools available to ICDC users:

  • (i) Web-based tools to build synthetic cohorts and explore data

  • (ii) JBrowse: genomic and transcriptomic files related to cases of interest can be selected and viewed through a single click to inspect sequencing reads at the nucleotide level, sequencing metrics, strand information, variants of interest, and more

  • (iii) Genomic and Transcriptomic data from the ICDC can be exported to the Seven Bridges’ Cancer Genomic Cloud (SB-CGC) for analysis without the need for any downloading. File contents are streamed as needed on demand from cloud storage

Highlights and challenges

The ICDC recently launched a new tool called the Data Model Navigator that enables users to intuitively navigate the graph-based data model to visualize the nodes, relationships, properties, values, and controlled vocabularies. A current area of focus is helping researchers overcome data submission challenges to encourage further contributions.

The Cancer Data Service (CDS; https://dataservice.datacommons.cancer.gov/#/) project began in September 2018 and its first dataset was made publicly available on SB-CGC in December 2020. CDS provides secure cloud-based storage and data sharing capabilities for multiple data types, in their originally submitted format, to facilitate secondary data sharing with the public. CDS hosts datasets that do not meet submission criteria for other CRDC DCs, including not having sufficient metadata to support data harmonization. CDS hosts both open and controlled access data from NCI programs such as the Human Tumor Atlas Network (HTAN), Patient-derived Xenografts Development and Trial Centers Research Network (PDXNet), and the Childhood Cancer Data Initiative (CCDI).

Accomplishments

Notable accomplishments include:

  • (i) The CDS has processed 22 data releases, sharing a total of approximately 400 TB of genomic and imaging data

  • (ii) A CDS portal for exploring data through faceted search was launched in June 2023

Data

CDS strives to be data type agnostic, and is open to accepting a wide range of data types. While the CDS requires standardized, validated metadata to allow for search across datasets in the CDS Portal and the SB-CGC, CDS does not harmonize other submitted data (e.g., BAM files, DICOM images) and releases data “as-submitted”. Currently, the CDS hosts genomics and imaging data, with plans to include additional data types as required. Examples of data types currently hosted in CDS include: WGS, WXS, RNA-seq, targeted sequencing data, bisulfite sequencing data, imaging data, and clinical data.

Tools

Users can access and analyze CDS data using hundreds of prebuilt workflows and tools on the SB-CGC (5).

Highlights and challenges

The CDS Portal enables data exploration across different data types and is a source for extensive metadata and raw data. Being data type agnostic, the CDS data model that underlies effective data exploration must be flexible to accept both existing and new data types, and to define minimum required metadata, which can prove challenging. An additional challenge is metadata submitted to CDS that does not satisfy NCI's vocabulary standards or is missing required data elements. To address this challenge, the CDS is updating submission requirements and implementing extensive validation steps during data submission and release.

The IDC (https://portal.imaging.datacommons.cancer.gov/) project (12, 13) began in July 2019 and launched in June 2021 to host publicly available cancer imaging data including a broad range of radiology, digital pathology, and microscopy imaging types, such as radiology collections from TCIA and others. IDC hosts images and image-derived data in the Digital Imaging and Communications in Medicine (DICOM) format (14) and harmonizes alternative formats into DICOM. Specific examples of data that are harmonized from vendor-specific or research formats include digital pathology and fluorescence microscopy images, image annotations and image-derived measurements.

Accomplishments

As of data release v15, IDC has released more than 67 TB of open imaging data from 63K+ cases, spanning over 135+ collections. Other accomplishments include:

  • (i) Hosted most public radiology collections curated by TCIA. Collections omitted are primarily those not prioritized for ingestion by the IDC stakeholders and not harmonized in DICOM representation

  • (ii) Harmonized digital pathology and fluorescence imaging collections into DICOM Slide Microscopy object representation, utilizing the DICOM-TIFF dual personality representation (15), achieving interoperability with the off-the-shelf archival, search, and visualization tools

  • (iii) Image analysis results and annotations are also harmonized into DICOM representation. Examples include volumetric regions of interest corresponding to anatomic structures and tumor areas, annotations of the individual image slices with respect to the presence of certain anatomic landmarks, and quantitative features extracted from the images (e.g., volume of the region and its shape characteristics)

  • (iv) Curated clinical data into metadata tables searchable using Standard Query Language (SQL) interface for building analysis cohorts

  • (v) Provided use cases (representative demonstrative examples of the utility of the resource in addressing specific needs of the cancer imaging community) accompanied by publicly available reproducible analysis notebooks, written reports, and analysis artifacts (16, 17)

  • (vi) Collaborated with public dataset initiatives of major cloud providers (Google Cloud Platform and Amazon Web Service) enabling fee-free egress and hosting of IDC data, improving sustainability, and providing seamless access to cloud-native AI/ML platforms (e.g., Vertex AI on GCP and Sagemaker on AWS)

Data

IDC currently houses deidentified open access image data. Examples of data in IDC include radiology image modalities (CT, MR, PET) from clinical, preclinical, canine, and phantom images; digital pathology images of hematoxylin and eosin (H&E)-stained tissue from clinical and preclinical studies; fluorescence microscopy images collected from HTAN initiative. IDC also provides clinical data and image-derived data such as annotations generated by experts or automated analysis techniques and definitions of the regions of interest (e.g., outlines of the anatomic organs or tumors), annotations of findings, measurements, and parametric maps.

Tools

Tools available to IDC users include:

  • (i) IDC-maintained tools

  • (ii) IDC search portal integrated with image visualization tools

  • (iii) Customized instance of the Open Health Imaging Foundation (OHIF) radiology viewer for visualization of radiologic modalities images and image-derived data (18)

  • (iv) Customized instance of the Slim microscopy viewer (19) for visualization of digital pathology and microscopy images, and related image-derived data

  • (v) OpenSlide (20) DICOM supports reading DICOM Slide Microscopy format within the widely used library providing a common interface to access a variety of image formats

  • (vi) Bio-Formats (21) DICOM supports for reading and writing DICOM Slide Microscopy format

  • (vii) Tools for harmonization of research and vendor-specific formats (i.e., TIFF, SVS, NRRD, NIFTI) into DICOM

  • (viii) Collaborative tools

  • (ix) Google Healthcare API and BigQuery: metadata accompanying IDC data, as available in DICOM files, is automatically extracted, versioned and is made available for searching using Standard Query Language (SQL) queries

  • (x) Google Cloud Platform (GCP): colocation of data within Google Cloud Platform enables scalable access to a variety of components within GCP, enabling the use of popular desktop applications, such as 3D Slicer (22), or batch image analysis tools, such as automatic segmentation using nnU-Net family of algorithms (23)

  • (xi) Other tools include Google Data Studio, used to build custom dashboards for data exploration, and Google Colab to streamline prototyping and dissemination of analysis workflows

  • (xii) MHub (https://mhub.ai): a repository of self-contained deep-learning models trained for a wide variety of applications in the medical and medical imaging domain. AI tools in MHub are curated with standardization and integration with IDC in mind, to simplify application of those tools to IDC data and integration of the analysis results back into IDC

Highlights and challenges

Recent highlights include:

  • (i) Demonstrations of integrations of various image analysis tools and workflows via Google Colab to simplify access to data and experimentation using the tools. The ICDC team is currently developing demonstration use cases illustrating end-to-end analysis and visualizations (16)

  • (ii) Demonstrations and use cases analyzing IDC data using CRDC Cloud Resources including best practices and integrations of custom analysis tools to simplify cloud resource use

Challenges the IDC has experienced around data ingestion are: (i) datasets not harmonized to supported standards; (ii) datasets missing metadata required for harmonization. Importing these retrospective datasets is time consuming and sometimes not feasible due to missing information. As with other DCs, the IDC is establishing best practices for using cloud computing with cancer imaging data.

The Clinical and Translational Data Commons (CTDC; https://clinical.datacommons.cancer.gov; not available until launch) project began in July 2021 and is scheduled to launch in early 2024. The CTDC platform will increase researcher access to clinical trial data as well as translational study data to maximize their impact by contributing to the development of a learning health care system that improves clinical outcomes and quality of life for individuals diagnosed with cancer.

Accomplishments

A major goal of the CTDC is to democratize data access by making deidentified clinical study data accessible to as broad a user base as possible. As such, the CTDC will offer:

Highlights and challenges

CTDC's debut will include previously unavailable deidentified clinical and molecular data from the Cancer Moonshot Biobank (CMB, https://moonshotbiobank.cancer.gov), with additional datasets from other high-impact studies and programs soon after, including data from immuno-oncology studies, childhood cancer studies, and more. The CTDC will allow data filtering by several characteristics including, but not limited to, diagnosis, demographics, and biospecimen type.

A major challenge for the CTDC will be the ongoing harmonization of the ever-expanding collection of clinical study datasets it will house. CTDC's agile data model was designed in alignment with the cancer Data Standards Registry and Repository (https://cadsr.cancer.gov/onedata/Home.jsp) to promote efficient updating to accommodate future, as yet unknown data sources. In addition, data elements will include references to Clinical Data Interchange Standards Consortium (https://www.cdisc.org/) Study Data Tabulation Model (https://www.cdisc.org/standards/foundational/sdtm), when applicable, to facilitate integration and cross-referencing across clinical study datasets.

Data access

One of the primary aims of the CRDC is to make the data hosted by each of its DCs Findable, Accessible, Interoperable, and Reusable (FAIR) as shown in Fig. 1. The CRDC uses a federated approach to data management. Although there are significant efforts underway to identify and standardize common elements and vocabularies across each data commons, this is a large effort that will take some time to complete and will require scheduled and consistent grooming as new data types continue to emerge. Data governance is currently managed by individual data commons; however, the CRDC is moving toward implementing a centralized data governance process with input and collaboration from all CRDC stakeholders.

Figure 1.

CRDC implements FAIR principles to advance cancer research.

Figure 1.

CRDC implements FAIR principles to advance cancer research.

Close modal

Technical infrastructure

A federated approach to data management necessitates a shared software architecture capable of elastic scalability. A main component of the shared architecture is the NCI's Data Commons Framework (DCF), a set of open-source software services based upon the Gen3 platform (https://gen3.org) that enable data object indexing as well as authentication and authorization. In addition, some of the data commons leverage the Bento Framework (https://github.com/bento-platform), a set of open-source software services developed by the Frederick National Laboratory for Cancer Research, that provides out of the box shared functionality, including an intuitive user interface, a navigable graph-based data model, faceted search capabilities, tooling to support data submitters and consumers, a next-generation genome browser for viewing genomic files, and a graphQL based API to support programmatic access.

Common features

Repositories across the CRDC were engineered with a level of continuity in mind, resulting in a similar set of features and tooling intended to support FAIR data. To make data FAIR (Fig. 1), the first step is to ensure that the data are Findable in an intuitive way. To this end, each of the data commons implements facet-based filtering and enables users to build cohorts of interest by selecting elements such as disease type, tumor grade, demographics, and data types. Once a user has found data of interest, the next step is to make it Accessible. CRDC data includes open as well as controlled access data. Controlled-access data requires users to first apply for access through the dbGaP or other mechanisms, upon which this authorization is synced through the DCF services, granting users access in a secure fashion. Once a user has found and accessed the data, the next step is to ensure the data commons are Interoperable, making it possible to integrate data from multiple data commons (e.g., genomic, proteomic, imaging) by leveraging common identifiers and data standards. Finally, the last step is to ensure all of this data is Reusable. The CRDC leverages globally unique identifiers and centralized servers to ensure that files are not copied from one cloud storage bucket to another when moving data from the data commons into the NCI Cloud Resources used for analysis. Pointers to respective files are used to stream the data on demand from their cloud location, eliminating the need to download files, which minimizes egress and ingress costs. Some of the CRDC components being operated on cloud have dependencies on cloud providers and NIH STRIDES for costs of data transfers, data downloads, and computing resources.

The CRDC provides a comprehensive solution to address a range of data sharing needs across the cancer research community. For example, while the CDS provides data sharing of “as-submitted” data with minimal metadata requirements, significantly simplifying the submission and release process, domain-specific commons like GDC require richer metadata and a harmonization process before data release. Although data submission and release in the GDC may be more burdensome than in CDS, it can enhance the data FAIR-ness for datasets that are selected to be included in the GDC. Each DC publishes detailed user guides on its website regarding submission requirements, processes and roles and responsibilities between submitters and DC staff members. Here is the link to GDC data submission guides (https://docs.gdc.cancer.gov/Data_Submission_Portal/Users_Guide/Data_Submission_Overview/), as an example. The supplementary Supplementary Table S1 lists URLs of data submission guidelines for each DC.

With the new NIH Data Management and Sharing Policy (https://sharing.nih.gov/), it is expected that there will be a spectrum of issues and uncertainties related (in particular) to the quality and timelines of submissions and data volume. CRDC leadership will work with key stakeholders of the individual data commons and the cancer research community to transparently define and refine procedures and policies related to data submission, access, and sharing.

Interoperability with other NIH data commons

In 2019, the NIH Cloud Platform Interoperability (NCPI; https://datascience.nih.gov/nih-cloud-platform-interoperability-effort) initiative was established by multiple NIH institutes to develop and implement guidelines and technical standards to empower end-user analyses across participating cloud-based platforms and facilitate the realization of a trans-NIH federated data ecosystem. The NCPI facilitates interoperability among the data and analysis platforms established by the NCI, National Human Genome Research Institute (NHGRI), National Heart Lung Blood Institute (NHLBI), National Center for Biotechnology Information (NCBI), and the NIH Common Fund (24). The NCI's CRDC has contributed significantly to these efforts. One key use case that demonstrated interoperability between NHGRI's AnVIL (https://anvilproject.org/) and the CRDC is “LINE-1 Retrotransposon Expression” work that utilized the Global Alliance for Genomics and Health standard Data Repository Service (https://www.ga4gh.org/news_item/drs-api-enabling-cloud-based-data-access-and-retrieval/) to access data across two cloud platforms. Briefly, this project integrated genomic and proteomic data from CRDC (GDC and PDC) with normal tissue expression data from AnVIL (GTEx) and tested a hypothesis that the activity of a specific retrotransposon, LINE1, is different in tumors than in normal cells (25). Details regarding this project are available at https://www.ncpi-acc.org/. The CRDC continues to identify novel use cases to further expand analytical capabilities and demonstrate platform interoperability.

Discussion and next steps

By collocating data with computing infrastructure and analysis tools, the CRDC promotes data sharing by:

  • (i) Lowering the barrier of entry to data access. Users can explore and analyze data in the cloud, eliminating the need to have their own storage and computing resources

  • (ii)Improving interoperability and enhancing data integration. Users can create their own third-party tools to connect with data commons through APIs such as the R package, TCGAbiolinks (https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)

  • (iii) Utilizing commercial cloud's enormous computing power to perform compute-intensive tasks

  • (iv) Providing users with options to use harmonized higher-level data such as somatic mutation calls, reducing the burden of processing raw data

CRDC actively collects feedback from users and is determined to continue to improve usability within and between each data commons. The primary focus points are (i) automating the data submission process to reduce the burden on data submitters; (ii) standardizing terminology to improve interoperability and data reusability; (iii) lowering the barrier of entry to data access by building self-explanatory intuitive user interfaces that are useful for all members of the cancer research community; (iv) implementing a centralized data governance framework.

In addition to the six existing data commons described in this manuscript, the NCI is currently exploring ways to meet the evolving needs of cancer researchers. Research data types to be supported in the future include immuno-oncology and population science data. The CRDC Data Commons serves as the foundation for the national cancer data ecosystem, promoting data sharing and accelerating cancer research.

R.L. Grossman reports grants from NIH/NCI, NIH/NHLBI, and grants from NIH HEAL Initiative during the conduct of the study. J. Otridge reports other support from NCI during the conduct of the study. R.R. Thangudu reports other support from Leidos Biomedical Research during the conduct of the study. J.S. Barnholtz-Sloan reports other support from NIH/NCI during the conduct of the study. No disclosures were reported by the other authors.

The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government.

The authors would like to thank Warren Kibbe, Juli Klemm, Elizabeth Hsu, Martin Ferguson, and David Pot for their review and thoughtful contributions. The full list of CRDC Program consortium members can be found in the Supplementary Data.

Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).

1.
Grossman
RL
,
Heath
A
,
Murphy
M
,
Patterson
M
,
Wells
W
.
A case for data commons: toward data science as a service
.
Comput Sci Eng
2016
;
18
:
10
20
.
2.
Brady
A
,
Charbonneau
A
,
Grossman
RL
,
Creasy
HH
,
Renner
R
,
Pihl
T
, et al
.
NCI Cancer Research Data Commons: Core Standards and Services
.
Cancer Res
2024
;
84
:
1384
7
.
3.
Kim
E
,
Davidsen
T
,
Davis-Dusenbery
BN
,
Baumann
A
,
Maggio
A
,
Chen
Z
, et al
.
NCI Cancer Research Data Commons: lessons learned and future state
.
Cancer Res
2024
;
84
:
1404
9
.
4.
Heath
AP
,
Ferretti
V
,
Agrawal
S
,
An
M
,
Angelakos
JC
,
Arya
R
, et al
.
The NCI genomic data commons
.
Nat Genet
2021
;
53
:
257
62
.
5.
Pot
D
,
Worman
Z
,
Baumann
A
,
Pathak
S
,
Beck
R
,
Beck
E
, et al
.
NCI Cancer Research Data Commons: cloud-based analytic resources
.
Cancer Res
2024
;
84
:
1396
403
.
6.
Thangudu
RR
,
Rudnick
PA
,
Holck
M
,
Singhal
D
,
MacCoss
MJ
,
Edwards
NJ
, et al
.
Proteomic Data Commons: A resource for proteogenomic analysis
[
abstract
]. In:
Proceedings of the Annual Meeting of the American Association for Cancer Research 2020
;
2020 Apr 27–28 and Jun 22–24
.
Philadelphia (PA)
:
AACR
;
2020
.
Abstract nr LB-242
.
7.
Matthiesen
R
,
Bunkenborg
J
.
Introduction to mass spectrometry-based proteomics
.
Methods Mol Biol
2013
;
1007
:
1
45
.
8.
Pino
LK
,
Just
SC
,
MacCoss
MJ
,
Searle
BC
.
Acquiring and analyzing data independent acquisition proteomics experiments without spectrum libraries
.
Mol Cell Proteomics
2020
;
19
:
1088
103
.
9.
Rudnick
PA
,
Markey
SP
,
Roth
J
,
Mirokhin
Y
,
Yan
X
,
Tchekhovskoi
DV
, et al
.
A description of the clinical proteomic tumor analysis consortium (CPTAC) common data analysis pipeline
.
J Proteome Res
2016
;
15
:
1023
32
.
10.
Skinner
ME
,
Uzilov
AV
,
Stein
LD
,
Mungall
CJ
,
Holmes
IH
.
JBrowse: a next-generation genome browser
.
Genome Res
2009
;
19
:
1630
8
.
11.
Wen
B
,
Wang
X
,
Zhang
B
.
PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations
.
Genome Res
2019
;
29
:
485
93
.
12.
Fedorov
A
,
Longabaugh
WJR
,
Pot
D
,
Clunie
DA
,
Pieper
S
,
Aerts
HJWL
, et al
.
NCI Imaging Data Commons
.
Cancer Res
2021
;
81
:
4188
93
.
13.
Fedorov
A
,
Longabaugh
WJR
,
Pot
D
,
Clunie
DA
,
Pieper
SD
,
Gibbs
DL
, et al
.
National cancer institute imaging data commons: toward transparency, reproducibility, and scalability in imaging artificial intelligence
.
Radiographics
2023
;
43
:
e230180
.
14.
Bidgood
WD
,
Horii
SC
,
Prior
FW
,
Van Syckle
DE
.
Understanding and using DICOM, the data interchange standard for biomedical imaging
.
J Am Med Inform Assoc
1997
;
4
:
199
212
.
15.
Clunie
DA
.
Dual-personality DICOM-TIFF for whole slide images: a migration technique for legacy software
.
J Pathol Inform
2019
;
10
:
12
.
16.
Schacherer
DP
,
Herrmann
MD
,
Clunie
DA
,
Höfener
H
,
Clifford
W
,
Longabaugh
WJR
, et al
.
The NCI imaging data commons as a platform for reproducible research in computational pathology
.
Comput Methods Programs Biomed
2023
;
242
:
107839
.
17.
Krishnaswamy
D
,
Bontempi
D
,
Thiriveedhi
V
,
Punzo
D
,
Clunie
D
,
Bridge
CP
, et al
.
Enrichment of the NLST and NSCLC-Radiomics computed tomography collections with AI-derived annotations
.
Sci Data
2024
;
11
:
25
.
18.
Ziegler
E
,
Urban
T
,
Brown
D
,
Petts
J
,
Pieper
SD
,
Lewis
R
, et al
.
Open health imaging foundation viewer: an extensible open-source framework for building web-based imaging applications to support cancer research
.
JCO Clin Cancer Inform
2020
;
4
:
336
45
.
19.
Gorman
C
,
Punzo
D
,
Octaviano
I
,
Pieper
S
,
Longabaugh
WJR
,
Clunie
DA
, et al
.
Interoperable slide microscopy viewer and annotation tool for imaging data science and computational pathology
.
Nat Commun
2023
;
14
:
1572
.
20.
Goode
A
,
Gilbert
B
,
Harkes
J
,
Jukic
D
,
Satyanarayanan
M
.
OpenSlide: a vendor-neutral software foundation for digital pathology
.
J Pathol Inform
2013
;
4
:
27
.
21.
Moore
J
,
Linkert
M
,
Blackburn
C
,
Carroll
M
,
Ferguson
RK
,
Flynn
H
, et al
.
OMERO and Bio-Formats 5: flexible access to large bioimaging datasets at scale
. In:
Ourselin
S
,
Styner
MA
, editors.
Medical Imaging 2015: Image Processing
[
Internet
];
2015
.
Available from
: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/9413/941307/OMERO-and-Bio-Formats-5–flexible-access-to-large/10.1117/12.2086370.short.
22.
Fedorov
A
,
Beichel
R
,
Kalpathy-Cramer
J
,
Finet
J
,
Fillion-Robin
JC
,
Pujol
S
, et al
.
3D slicer as an image computing platform for the quantitative imaging network
.
Magn Reson Imaging
2012
;
30
:
1323
41
.
23.
Isensee
F
,
Jaeger
PF
,
Kohl
SAA
,
Petersen
J
,
Maier-Hein
KH
.
nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation
.
Nat Methods
2021
;
18
:
203
11
.
24.
Grossman
RL
,
Boyles
RR
,
Davis-Dusenbery
BN
,
Haddock
A
,
Heath
AP
,
O'Connor
BD
, et al
.
A framework for the interoperability of cloud platforms: towards FAIR data in SAFE environments
.
Sci Data
2024
;
11
:
241
.
25.
McKerrow
W
,
Wang
X
,
Mendez-Dorantes
C
,
Mita
P
,
Cao
S
,
Grivainis
M
, et al
.
LINE-1 expression in cancer correlates with p53 mutation, copy number alteration, and S phase checkpoint
.
Proc Natl Acad Sci U S A
2022
;
119
:
e2115999119
.
This open access article is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.