Multiomics experiments are increasingly commonplace in biomedical research and add layers of complexity to experimental design, data integration, and analysis. R and Bioconductor provide a generic framework for statistical analysis and visualization, as well as specialized data classes for a variety of high-throughput data types, but methods are lacking for integrative analysis of multiomics experiments. The MultiAssayExperiment software package, implemented in R and leveraging Bioconductor software and design principles, provides for the coordinated representation of, storage of, and operation on multiple diverse genomics data. We provide the unrestricted multiple ‘omics data for each cancer tissue in The Cancer Genome Atlas as ready-to-analyze MultiAssayExperiment objects and demonstrate in these and other datasets how the software simplifies data representation, statistical analysis, and visualization. The MultiAssayExperiment Bioconductor package reduces major obstacles to efficient, scalable, and reproducible statistical analysis of multiomics data and enhances data science applications of multiple omics datasets. Cancer Res; 77(21); e39–42. ©2017 AACR.

Multiassay experiments collect multiple, complementary data types for a set of specimens. Bioconductor (1) provides classes to ensure coherence between a single assay and patient data during data analysis, such as eSet and SummarizedExperiment (2). However, novel challenges arise in data representation, management, and analysis of multiassay experiments (3) that cannot be addressed by these or other single-assay data architectures. These include (i) coordination of different assays on, for example, genes, miRNAs, or genomic ranges; (ii) coordination of missing or replicated assays; (iii) sample identifiers that differ between assays; (iv) reshaping data to fit the variety of existing statistical and visualization packages; (v) doing the above in a concise and reproducible way that is amenable to new assay types and data classes.

The need for a unified data model for multiomics experiments has been recognized in other projects, such as MultiDataSet (4) and CNAMet (5). Our developments are motivated by an interest in bridging effective single-assay application program interface (API) elements, including endomorphic feature and sample subset operations, to multiomic contexts of arbitrary complexity and volume (Supplementary Table S1). A main concern in our work is to allow data analysts and developers to simplify the management of both traditional in-memory assay stores for smaller datasets, and out-of-memory assay stores for very large data in such formats as HDF5 (6), tabix-indexed variant call format (VCF; ref. 7), or Google BigTable (8).

MultiAssayExperiment provides data structures and methods for representing, manipulating, and integrating multiassay genomic experiments. It integrates an open-ended set of R and Bioconductor single-assay data classes, while abstracting the complexity of back-end data objects and providing a sufficient set of data manipulation, extraction, and reshaping methods to interface with most R/Bioconductor data analysis and visualization tools. We demonstrate its use by representing unrestricted data from The Cancer Genome Atlas as a single MultiAssayExperiment object per cancer type and demonstrating greatly simplified multiassay analyses with these and other public multiomics datasets.

MultiAssayExperiment (https://bioconductor.org/packages/MultiAssayExperiment) introduces a Bioconductor object-oriented S4 class, defining a general data structure for representing multiomics experiments. This data class has three key components: (i) colData, a “primary” dataset containing patient or cell line–level characteristics, such as pathology and histology; (ii) ExperimentList, a list of results from complementary experiments; and (iii) sampleMap, a map that relates these elements (Fig. 1A). ExperimentList data elements may be of any data class that has standard methods for basic subsetting (single square bracket “[”) and dimension names and sizes [“dimnames()” and “dim()”]. Key methods available for manipulating the MultiAssayExperiment data class include:

  • (i) A constructor function and associated validity checks that simplifies creating MultiAssayExperiment objects while allowing for flexibility in representing complex experiments.

  • (ii) Subsetting operations allowing data selection by genomic identifiers or ranges, clinical/pathologic variables, available complete data (subsets that include no missing values), and by specific assays.

  • (iii) Robust and intuitive extraction and replacement operations for components of the MultiAssayExperiment.

The MultiAssayExperiment API is based wherever possible on SummarizedExperiment while supporting heterogeneous multi-omics experiments. MultiAssayExperiment design, constructor, subsetting, extraction, and helper methods, as well as methods and code for the examples demonstrated here, are detailed in the Supplementary Methods.

The MultiAssayExperiment class and methods (Table 1) provide a flexible framework for integrating and analyzing complementary assays on an overlapping set of samples. It integrates any data class that supports basic subsetting and dimension names, so that many data classes are supported by default without additional accommodations. The MultiAssayExperiment class (Fig. 1A) ensures correct alignment of assays and patients, provides coordinated subsetting of samples and features while maintaining correct alignment, and enables simple integration of data types to formats amenable to analysis by existing tools. Basic usage is outlined in Supplementary Video S1 (https://www.youtube.com/watch?v=w6HWAHaDpyk&feature=youtu.be) and in the QuickStartMultiAssay vignette accompanying the package.

We coordinated over 300 assays from over 11,000 patients of 33 different cancer types from The Cancer Genome Atlas as one MultiAssayExperiment per cancer type (Supplementary Table S2). These data objects link each assay to their patient of origin, allowing more straightforward selection of cases with complete data for assays of interest, and integration of data across assays and between assays and clinical data. We demonstrate applications of MultiAssayExperiment for visualizing the overlap in assays performed for adrenocortical carcinoma patients (Fig. 1B), confirming recently reported correlations between somatic mutation and copy number burden in colorectal cancer and breast cancer (Fig. 1C), identifying an SNP/methylation quantitative trait locus using remotely stored tabix-indexed VCF files for the 1000 genomes project (Fig. 1D), multiassay gene set analysis for ovarian cancer (Supplementary Figs. S1 and S2), and calculating correlations between copy number, gene expression, and protein expression in the NCI-60 cell lines (Supplementary Fig. S3). Demonstrative code chunks and fully reproducible scripts are given to demonstrate the simple and powerful flexibility provided by MultiAssayExperiment.

MultiAssayExperiment enables coordinated management and extraction of complex multiassay experiments and clinical data, with the same ease of user-level coding as for a single experiment. Its extensible design supports any assay data class meeting basic requirements, including out-of-memory representations for very large datasets. We have confirmed “out-of-the-box” compatibility with on-disk data representations, including the DelayedMatrix class via an HDF5 backend (6), and the VcfStack class based on the GenomicFiles infrastructure. Future work will focus on higher level visualization, integration, and analysis tools using MultiAssayExperiment as a building block. This project will receive long-term support as a necessary element of multiassay data representation and analysis in Bioconductor.

No potential conflicts of interest were disclosed.

Conception and design: M. Ramos, L. Schiffer, P. Chapman, D. Gomez-Cabrero, K.D. Hansen, M. Morgan, V. Carey, L. Waldron

Development of methodology: M. Ramos, L. Schiffer, T. Chan, P. Chapman, K.D. Hansen, M. Morgan, V. Carey, L. Waldron

Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): M. Ramos, S.R. Davis, H. Kodali, V. Carey

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): A. Re, R. Azhar, A. Basunia, P. Chapman, S.R. Davis, A.C. Culhane, B. Haibe-Kains, A.S. Mer, M. Riester, V. Carey, L. Waldron

Writing, review, and/or revision of the manuscript: M. Ramos, L. Schiffer, A. Re, P. Chapman, S.R. Davis, D. Gomez-Cabrero, A.C. Culhane, B. Haibe-Kains, H. Kodali, A.S. Mer, M. Riester, V. Carey, L. Waldron

Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): M. Ramos, A. Basunia, C. Rodriguez, T. Chan, H. Kodali, M.S. Louis

Study supervision: V. Carey, L. Waldron

This work was also supported by the CUNY High Performance Computing Center, which is operated by the College of Staten Island and funded, in part, by grants from the City of New York, State of New York, CUNY Research Foundation, and National Science Foundation grants CNS-0958379, CNS-0855217, and ACI 1126113.

The authors' work was funded by the NCI of the NIH (U24CA180996 to M. Morgan).

1.
Huber
W
,
Carey
VJ
,
Gentleman
R
,
Anders
S
,
Carlson
M
,
Carvalho
BS
, et al
Orchestrating high-throughput genomic analysis with Bioconductor
.
Nat Methods
2015
;
12
:
115
21
.
2.
Lawrence
M
,
Huber
W
,
Pagès
H
,
Aboyoun
P
,
Carlson
M
,
Gentleman
R
, et al
Software for computing and annotating genomic ranges
.
PLoS Comput Biol
2013
;
9
:
e1003118
.
3.
Kannan
L
,
Ramos
M
,
Re
A
,
El-Hachem
N
,
Safikhani
Z
,
Gendoo
DMA
, et al
Public data and open source tools for multi-assay genomic investigation of disease
.
Brief Bioinform
2016
;
17
:
603
15
.
4.
Hernandez-Ferrer
C
,
Ruiz-Arenas
C
,
Beltran-Gomila
A
,
González
JR
. 
MultiDataSet: an R package for encapsulating multiple data sets with application to omic data integration
.
BMC Bioinformatics
2017
;
18
:
36
.
5.
Louhimo
R
,
Hautaniemi
S
. 
CNAmet: an R package for integrating copy number, methylation and expression data
.
Bioinformatics
2011
;
27
:
887
8
.
6.
Folk
M
,
Heber
G
,
Koziol
Q
,
Pourmal
E
,
Robinson
D
. 
An overview of the HDF5 technology suite and its applications
.
Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
;
2011
Mar 25;
Uppsala
,
Sweden
.
New York, NY
:
ACM
; 
2011
.
p.
36
47
.
7.
Obenchain
V
,
Lawrence
M
,
Carey
V
,
Gogarten
S
,
Shannon
P
,
Morgan
M
. 
VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants
.
Bioinformatics
2014
;
30
:
2076
8
.
8.
Chang
F
,
Dean
J
,
Ghemawat
S
,
Hsieh
WC
,
Wallach
DA
,
Burrows
M
, et al
Bigtable: a distributed storage system for structured data
.
ACM Trans Comput Syst
2008
;
26
:
4
.
9.
Conway
JR
,
Lex
A
,
Gehlenborg
N
. 
UpSetR: An R package for the visualization of intersecting sets and their properties
.
Bioinformatics
; 
2017 Jun 22
.
[Epub ahead of print]
.
10.
Davoli
T
,
Uno
H
,
Wooten
EC
,
Elledge
SJ
. 
Tumor aneuploidy correlates with markers of immune evasion and with reduced response to immunotherapy
.
Science
2017
;
355
:
pii:eaaf8399
.