The ENCODE (Encyclopedia of DNA Elements) project has delivered a first systematic look at non–protein-coding stretches of the human genome, analyzing regions of gene transcription, transcription factor association, chromatin structure, and histone modification across 147 cell lines.

For years we've known that the non-protein-coding stretches that make up about 98% of our DNA aren't just junk. Now the ENCODE (Encyclopedia of DNA Elements) project has taken a first systematic look at these little-understood portions of the genome. Bringing together 32 research groups and more than 400 researchers, the collaboration analyzed regions of transcription, transcription factor association, chromatin structure, and histone modification across 147 human cell lines.

When the ENCODE group published no fewer than 30 papers in September, one finding that drew much popular attention was that about 80% of the genome can be described as functional—participating in at least one biochemical RNA- or chromatin-associated event in at least one cell type.

However, “it's the depth of characterization of the genome regulation that's so important,” says Peter Campbell, MD, PhD, head of Cancer Genetics and Genomics at the Wellcome Sanger Institute in Cambridge, UK, who was not directly involved in the project. “DNA varies within 3 dimensions of space and also changes over time, with cell differentiation, cell division, and so on. ENCODE is really the first attempt to deeply characterize this dynamic environment in a systematic way.”

“One of the most exciting and surprising themes is that there is a much larger number of regulatory elements that appear to control gene expression from a distance,” says Bradley Bernstein, MD, PhD, associate professor of pathology at Massachusetts General Hospital and Harvard Medical School in Boston and senior associate member of the Broad Institute in Cambridge, MA. “ENCODE has identified a huge number of such regulatory elements, and through various functional genomic assays begun to identify their chromatin structures, the transcription factors that bind them, and their likely target genes.”

Cutting across a broad swath of normal and disease cell lines, the giant project has created an unprecedented resource for cancer researchers. “Cancer ultimately is a disease of the genome,” points out John Stamatoyannopoulos, PhD, associate professor of genome sciences and medicine at the University of Washington School of Medicine in Seattle and an ENCODE group leader. “Whether your research is interested in gene pathways, genetic changes, deletions, or duplications, all of these now have additional layers of information that can be brought to bear by ENCODE, literally by just dialing them up in genome browsers.”

One fruitful area of study will be in gene deletions in cancer, where ENCODE shows that “just about every one of these non-gene regions has a deletion in it, with maps that connect it to genes,” he says. “That actually increases greatly the value of existing data that had been generated, such as cancer genomes.”

“This kind of data can be brought out not only in areas of gene regulation, but in a lot of alternative transcripts and other features of gene structures that are different in immortal cells versus normal cells,” he adds.

Among many other promising examples, ENCODE data show that in cancer and other malignancies, “the same core group of about 25 different transcription factors is being disrupted,” says Stamatoyannopoulos. “A number of these transcription factors were known to have roles in cancer, but other ones were not. Also, they are connected in ways that suggest that there are certain regulatory genetic backgrounds that may make individuals more susceptible not only to individual kinds of cancers but perhaps to malignancies generically. This is not by any stretch of imagination a conclusive mechanistic finding, but it's a hypothesis that now can be logically pursued.”

Proponents say that the enormous, fully public ENCODE data sets often will help researchers cut to the chase in determining what regulates a specific gene—by simply looking up data rather than spending a year or 2 performing experiments to figure it out.

“ENCODE data will be quite important for understanding transcriptional regulation and how perturbing pathways might relate to downstream effects,” says Wellcome Sanger's Campbell. “Even if you don't go near a wet lab, you could make some really interesting correlations between various bits and pieces of data to try to understand some quite fundamental aspects of cell biology, from replication through to high-level chromatin structure and transcriptional networks.”

“It's a massive, high-quality, high-throughput data set,” says Christina Leslie, PhD, associate member and lab head in the computational biology program at Memorial Sloan-Kettering Cancer Center in New York. “If you have an idea and you know how to use the data, you can do studies with this already existing massive resource, and that's exciting.”

“But, just like having the genome doesn't tell you where the regulatory elements are, having a lot of ENCODE data on transcription-factor binding and DNA hypersensitive sites and other assays doesn't tell you how it all works together,” she emphasizes. “Now that we have all these layers of information on top of the genome, and all the cell-type–specific information, it's really time to start asking biological questions and trying to answer them. There's so much uncharted territory and we're just beginning to make sense of it.” – Eric Bender