To target and personalize cancer therapies to the genomic aberrations present in a particular patient's tumor, researchers need to identify the genes that drive the progression of malignant tumors. This requires analysis of somatic mutations from large samples of patients to identify driver mutations up to the “tail end” of the frequency distribution. Community genomics data sets from the TCGA and ICGC projects represent a valuable resource to which researchers can add their own data to gain statistical power in their analyzes. The current issue to this methodology is the highly fragmented storage of public and private data and the inefficient access to public data. Researchers spend weeks to months downloading hundreds of terabytes of data from central repositories before computations can begin. What is needed is a data “safe haven” where researchers can bring compute to the reference data without the need to incur in bulky data transfers or duplicative storage costs, in an environment that protects the privacy of the patients’ data. In collaboration with the International Cancer genome Consortium, we developed ShareSeq, a genomic data safe haven platform that provides an informatics solution for storing, handling and analyzing protected identifiable genomic data. This resource leverages Annai-GNOS, the technology which we developed to create and manage the CGHub TCGA repository together with UCSC, and that is being used in the ICGC Pan Cancer Analysis of Whole Genomes project, and combines it with a high-performance compute environment and an array of tools to process and analyze genomic data. Built using a walled garden approach, where the data is stored, processed and managed within the security of the system, ShareSeq avoids the complexity of assured end point encryption. GeneTorrent, our fast and secure file transfer mechanism, enables researchers’ private information to be transferred into the walled garden simply and securely to combine it with the public datasets. ShareSeq differs dramatically from the traditional cloud in two features: (i) formal mechanisms and a service level agreement to store protected identifiable genomic data securely and safely, built into the system from the ground up; (ii) the system is specifically designed for genomic computing over large shared data sets supporting common bioinformatics workflow tools; (iii) Fast download and access to raw genomic information and its metadata; and (iv) access controls leveraging federated authentication systems that Data Access Committees utilize to authorize access to the restricted data. ShareSeq is initially hosting raw, normalized, and processed data from the ICGC, but we envision that over time it will host an increasing number of high value reference genomic public datasets and add standards-based interfaces promoted by the Global Alliance of Genomes and Health to allow broader data discovery and sharing.
Citation Format: Francisco M. De La Vega, Ying Wu, Tal Shmaya, Thomas Schlumpberger, James Wiley, Akshay Patel, Raja Hayek. A novel data safe haven approach to bring analyses to the International Cancer Genome Consortium data. [abstract]. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr LB-308. doi:10.1158/1538-7445.AM2015-LB-308