The Cancer Genomics Hub, managed by researchers at the University of California, Santa Cruz, and housed at the San Diego Supercomputer Center, came out of beta testing in May 2012.

A new system housed at the San Diego Supercomputer Center grants computer-savvy researchers easy access to data from The Cancer Genome Atlas (TCGA) and other large cancer-sequencing projects. Called the Cancer Genomics Hub (CGHub), the system came out of beta testing in May and is available at https://cghub.ucsc.edu/.

Providing access to data from thousands of patients is a huge technical challenge, but it's the only way to come up with statistically significant conclusions about the complex genetic changes underlying cancer, says David Haussler, PhD, principal investigator at CGHub, which is being run by researchers at the University of California, Santa Cruz. “It would be a horrible mistake to have one database for lung cancer and one database for breast cancer, with sequencing data siloed inside institutions,” says Haussler.

CGHub can store 5 petabytes (5 million gigabytes) of raw sequence data, scalable up to 20 petabytes. TCGA currently generates about 10 terabytes (10,000 gigabytes) of sequencing data a month. CGHub also will support the National Cancer Institute's (NCI) Therapeutically Applicable Research to Generate Effective Treatments program, and the Cancer Genome Characterization Initiative.

Haussler expects the fast stream of cancer data will become a deluge. To get the best picture of the molecular basis of cancer, the quality of the DNA sequences gathered by these centers will double, and the amount of data generated will double with it, he says. Add to that data encoding RNA sequences and sequences from healthy tissues, and it will total a terabyte of data per patient.

At these levels, researchers have to worry not only about storage capacity, but also data-transfer rates. They won't be able to browse through large amounts of information in CGHub but will have to choose carefully; it may take 8 hours to download a genome. Heavy users will be encouraged to store their server racks inside the supercomputing center to speed these transfers.

Accessing the information in the database will require computer savvy and, to protect patient privacy, approval from the NIH. Haussler says the best way for cancer researchers lacking bioinformatics expertise to sort through the data is to have developers in their labs write scripts that will automatically pull sequence data from the hub. Such a script might watch for new genomes from patients with a particular type of cancer, and then automatically download the data.

The CGHub will become more user-friendly as developers in Santa Cruz and elsewhere build analytical tools on top of it. But, Haussler says, it's important to give scientists and engineers access to the raw data now. “We need to make it a priority to share our data,” he says.

The targeted nanoparticle BIND-014, which has performed well in very early clinical trials and in animal studies, is a polymer sphere loaded with docetaxel and coated on the surface with polyethylene glycol molecules. [Digizyme, Inc.]

The targeted nanoparticle BIND-014, which has performed well in very early clinical trials and in animal studies, is a polymer sphere loaded with docetaxel and coated on the surface with polyethylene glycol molecules. [Digizyme, Inc.]

Close modal

For more news on cancer research, visit Cancer Discovery online at http://CDnews.aacrjournals.org.