The goal of the International Cancer Genome Consortium (ICGC) is to analyze the cancer genomes of at least 500 tumour samples with matched controls from 50 different cancer types and subtypes, building a comprehensive catalogue of somatic abnormalities for the benefit of the research community. The amount of data ICGC members will generate is close to that of 50,000 human genome projects and, to date, has received commitments for 107 projects to study more than 27,000 tumor genomes.
The ICGC Data Coordination Center (DCC) is responsible for collecting, curating, aggregating, and disseminating the data generated by the consortium’s member projects. Given the size and the complexity of the ICGC data, these tasks represent significant scientific and technological challenges that require a performant, robust software infrastructure. Key to this infrastructure is the ability to scale as data grows.
Using state-of-the-art Big Data, bioinformatics and cloud computing technologies, we developed a suite of web-based applications and microservices that enable member projects to first submit their data and validate their submissions according to the rules defined in the submission specification. Following validation, the data is processed, annotated and loaded into the data portal using a modular Extract-Transform-Load (ETL) pipeline. Submission, ETL and portal systems are built using scalable and distributed technologies such as Hadoop, Spark, MongoDB and ElasticSearch. Spark is used to validate, join, index, and harmonize annotations on submitted variants while ElasticSearch powers our variant query engine, API and portal displays.
All source code is open to the community under the GPLv3 license.
Citation Format: Junjun Zhang, Bob Tiernay, Dusan Andric, Phuong-My Do, Sid Joshi, Vitalii Slobodianyk, Chang Wang, Shane Wilson, Andy Yang, Vincent Ferretti. The ICGC data portal and its underlying open source software architecture [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 2602. doi:10.1158/1538-7445.AM2017-2602