Abstract
Effective data sharing is key to accelerating research to improve diagnostic precision, treatment efficacy, and long-term survival in pediatric cancer and other childhood catastrophic diseases. We present St. Jude Cloud (https://www.stjude.cloud), a cloud-based data-sharing ecosystem for accessing, analyzing, and visualizing genomic data from >10,000 pediatric patients with cancer and long-term survivors, and >800 pediatric sickle cell patients. Harmonized genomic data totaling 1.25 petabytes are freely available, including 12,104 whole genomes, 7,697 whole exomes, and 2,202 transcriptomes. The resource is expanding rapidly, with regular data uploads from St. Jude's prospective clinical genomics programs. Three interconnected apps within the ecosystem—Genomics Platform, Pediatric Cancer Knowledgebase, and Visualization Community—enable simultaneously performing advanced data analysis in the cloud and enhancing the Pediatric Cancer knowledgebase. We demonstrate the value of the ecosystem through use cases that classify 135 pediatric cancer subtypes by gene expression profiling and map mutational signatures across 35 pediatric cancer subtypes.
To advance research and treatment of pediatric cancer, we developed St. Jude Cloud, a data-sharing ecosystem for accessing >1.2 petabytes of raw genomic data from >10,000 pediatric patients and survivors, innovative analysis workflows, integrative multiomics visualizations, and a knowledgebase of published data contributed by the global pediatric cancer community.
This article is highlighted in the In This Issue feature, p. 995
Introduction
Cancer is the number-one cause of death by disease among children, with more than 15,000 new diagnoses within the United States alone each year (1). The advent of high-throughput genomic profiling technology such as massively parallel sequencing has enabled mapping of the entire 3 billion bases of genetic code for individual human genomes, including those of pediatric cancer. Major pediatric cancer genome research initiatives such as the St. Jude/Washington University Pediatric Cancer Genome Project (PCGP; ref. 2) and NCI's Therapeutically Applicable Research to Generate Effective Treatments (TARGET, https://ocg.cancer.gov/programs/target) have profiled thousands of pediatric cancer genomes. The resulting data, made accessible through public data repositories such as the database of Genotypes and Phenotypes (dbGaP) or European Genome-phenome Archive (EGA), have been used to generate new insights into the mechanisms of cancer initiation and progression (3–7), to discover novel targets including those for immunotherapy (8–10), and to build comprehensive genomic landscape maps for the development of precision therapy (11–16).
Data sharing, a prerequisite for genomic research for almost 30 years, is especially important for pediatric cancer, a rare disease with many subtypes driven by diverse and distinct genetic alterations. Based on the annual cancer diagnoses collected from NCI's Surveillance, Epidemiology and End Results (SEER) program for the period from 1990 to 2016 (https://seer.cancer.gov), more than 50% of the pediatric cancer subtypes are rare cancers with an annual incidence of <200 cases in the United States. Therefore, samples acquired by a single institute, a single research initiative, or, in some instances, even a single nation may lack sufficient power for genomic discovery and clinical correlative analysis. In addition, the discovery of structural variations and noncoding variants, which are important classes of driver variants in pediatric cancer (15, 17–19), requires the use of whole-genome sequencing (WGS) to interrogate noncoding regions, which constitute more than 98% of the human genome. This imposes another challenge in sharing pediatric cancer genome data, as the size of WGS data is approximately 10 times larger than that of whole-exome sequencing (WES) data, which profile only the coding regions.
To share pediatric cancer genome data using the established public repository model requires major investments in time, professional support, and computing resources from users and data providers alike. Under this model (Fig. 1A, left), genomic data become available for download after submission to a public repository by a computational professional. To use the data, a researcher needs to (i) prepare and submit a request for data access and wait for approval; (ii) download data from the public repository to a local computing infrastructure; (iii) reprocess for data harmonization and annotation using the current reference knowledgebase; (iv) perform new analysis or integrative analysis by incorporating custom data; and often (v) submit the new data or the results back to the public repository. With continued expansion of the public data repository and user data, integrating public and local data is an iterative process requiring continued upscaling of local computational resources. Cloud-based technology can establish a shared computing infrastructure for data access and computing for all users, which can improve the efficiency of data analysis by removing the barriers on computational infrastructure required for data transfer and hosting so that computing resources can be dedicated to innovative data analysis and novel methods development (Fig. 1A, right).
To accelerate research on pediatric cancer and other childhood catastrophic diseases, we developed St. Jude Cloud (https://www.stjude.cloud), a data-sharing ecosystem featuring both open and controlled access to genomic data of >10,000 pediatric cancers generated from retrospective research projects as well as prospective clinical genomics programs (Fig. 1B) at St. Jude Children's Research Hospital (St. Jude). St. Jude Cloud was built by St. Jude in partnership with DNAnexus and Microsoft to leverage our combined expertise in pediatric cancer genomic research (2, 5, 20, 21), secure genomic data hosting on the cloud, and Azure cloud computing. St. Jude Cloud is comprised of three interconnected applications: (i) A Genomics Platform that enables controlled access to harmonized raw genomic data as well as end-to-end analysis workflows powered by the innovative algorithms that we developed, tested, and validated on data generated from pediatric patient samples; (ii) Open access to a knowledgebase portal, PeCan (Pediatric Cancer), that enables exploration of curated somatic variants of >5,000 pediatric cancer genomes from published literature contributed by St. Jude and other institutions; and (iii) A Visualization Community that enables the scientific community to explore published pediatric cancer landscape maps and integrative views of genomic data, epigenetic data, and clinical information on pediatric cancers (Fig. 1B, bottom). We demonstrate the power of the St. Jude Cloud ecosystem in unveiling important genomic features of pediatric cancer through two use cases: (i) classification of 135 subtypes of pediatric cancer using 1,565 RNA-sequencing (RNA-seq) samples using a workflow that also supports user data integration; and (2) characterization of mutational burden and signatures using WGS data generated from 35 subtypes of pediatric cancer using a workflow that can also perform custom data analysis and comparison of mutational signatures across different cancer cohorts.
Results
Pediatric Cancer Data Resource on St. Jude Cloud
St. Jude Cloud hosts 12,104 WGS samples, 7,697 WES samples, and 2,202 RNA-seq samples generated from pediatric patients with cancer or long-term survivors of pediatric cancer, making it the largest publicly available genomic data resource for pediatric cancer (Fig. 2A). Current datasets were acquired from research initiatives such as the St. Jude/Washington University PCGP (2), St. Jude Lifetime Cohort Study (SJLIFE; ref. 22), and Childhood Cancer Survivor Study (CCSS; ref. 23), as well as from prospective clinical programs such as the Genomes for Kids (G4K) clinical research study of pediatric patients with cancer (https://clinicaltrials.gov/ct2/show/NCT02530658) and the Real-time Clinical Genomics (RTCG) initiative at St. Jude. Both G4K and RTCG use a three-platform clinical WGS, WES, and transcriptome sequencing of every eligible patient at St. Jude (21). Raw sequence data from all studies were mapped to the latest (GRCh38) human genome assembly using the same analytic process to ensure data harmonization (Methods). In total, 1.25 petabytes (PB) of genomics data are readily available for access in St. Jude Cloud with more than 90% (1.15 PB) of this being WGS.
When considering only WGS, the collective dataset comprises 3,551 paired tumor–normal pediatric cancer samples and 7,746 germline-only samples of long-term survivors enrolled in SJLIFE or CCSS. Major diagnostic categories of the cancer and survivorship genomes, which include pediatric leukemia, lymphoma, central nervous system (CNS) tumors, and >12 types of non-CNS solid tumors (Fig. 2B), are similar except for Hodgkin lymphoma and non-Hodgkin lymphoma. The lymphoma samples constitute 18% of the cases in the survivorship cohort but are underrepresented in the cancer genomes as lymphoma was not selected for pediatric genomic landscape mapping initiatives (e.g., PCGP).
Deposition of WGS, WES, and RNA-seq data generated from RTCG has become an important avenue for expanding the cancer genomic data content on St. Jude Cloud. We have developed a robust pipeline for the monthly data deposition which involves verification of patient consent protocols (and active monitoring for revocation of previous consent), sample deidentification, remapping to the latest genome build, and quality checking, all in accordance with legal and ethical guidelines. Basic clinical annotation is retrieved by querying databases of electronic medical records (EMR), and data are harmonized prior to uploading to St. Jude Cloud for public release (Supplementary Fig. S1). From March 2019 through July 2020, 1,996 WGS, 2,684 WES, and 1,220 RNA-seq datasets were uploaded to St. Jude Cloud (Fig. 2C, left). Importantly, these prospective samples include 51 pediatric cancer samples comprising 27 rare subtypes (Fig. 2C, right) not represented in the retrospective cancer samples on St. Jude Cloud. We anticipate continued expansion of genomic data at this pace on St. Jude Cloud in the future.
End-to-End Genomic Analysis Workflows
To enable researchers with little to no formal computational training to perform sophisticated genomic analysis, we have deployed end-to-end analysis workflows designed with a point-and-click interface for uploading input files and graphically visualizing the results for scientific interpretation (https://platform.stjude.cloud/workflows). Advanced computational users can access a command line interface for batched job submission and runtime parameter optimization. Currently, eight production grade workflows, tested and used by researchers from St. Jude as well as external institutions, have been deployed on St. Jude Cloud. Comprehensive documentation has been developed for these workflows and is updated based on user feedback.
Four of these workflows have integrated cancer genomic analysis algorithms developed using pediatric cancer datasets such as PCGP, and their performance has been iteratively improved by the growing knowledgebase of pediatric cancer. They include: (i) Rapid RNA-seq, which predicts gene fusions using the CICERO algorithm (24) that has discovered targetable fusions in high-risk pediatric leukemia (8), high-grade glioma (HGG; ref. 5), and melanoma (25); (ii) PeCanPIE (26), which classifies germline variant pathogenicity using the Medal Ceremony algorithm that was developed to assess germline susceptibility of pediatric cancer (5) and genetic risk for subsequent neoplasms among survivors of childhood cancer (27); (iii) cis-X, which detects noncoding driver variants and has discovered noncoding drivers in pediatric T-lineage leukemia (28); and (iv) SequencErr, which measures and suppresses next-generation sequencing errors (29).
In additionally, we optimized several workflows commonly used by basic research laboratories. These include (i) the chromatin immunoprecipitation sequencing (ChIP-seq) peak calling pipeline, which detects narrow peaks using MACS2 (30) or broad peaks using SICER (31); (ii) the WARDEN pipeline, which performs RNA-seq differential expression using the R packages VOOM for normalization and LIMMA for analysis (32); (iii) the mutational signature pipeline, which finds Catalogue of Somatic Mutations in Cancer (COSMIC) mutational signatures for a user-provided somatic single-nucleotide variant (SNV) variant call format (VCF) file(s) and compares the summary with a user-selected subtype in pediatric cancer (33); and (iv) the RNA-seq expression classification pipeline which projects user-supplied RNA-seq data onto a t-distributed stochastic neighbor embedding (t-SNE) plot (34) generated by >1,500 RNA-seq samples.
PeCan Knowledgebase
To integrate pediatric cancer genomic data generated by the global research community, we developed PeCan, which assembles somatic variants present at diagnosis or relapse, germline pathogenic variants, and gene expression from the published literature. All data, which are reannotated and curated to ensure quality and consistency, can be explored dynamically using our visualization tool ProteinPaint (35). Currently, PeCan presents data published by PCGP, TARGET, The German Cancer Research Center, Shanghai Children's Medical Center, and The University of Texas Southwestern Medical Center (Supplementary Table S1). Variant distribution and expression pattern for a gene of interest can be queried and visualized for 5,161 cancer samples. Curated pathogenic or likely pathogenic variants can also be queried directly and visualized on PeCanPIE's variant page (26), which presents variant allele frequencies from public databases, results from in silico prediction and pathogenicity prediction algorithms, related literature, and pathogenicity classification determined by the St. Jude Clinical Genomics tumor board.
Data Visualization
Data visualization is critical for integrating multidimensional cancer genomics data so that researchers can gain insight into the molecular mechanisms that initiate and cause the progression of cancer. We developed generalized tools such as ProteinPaint (35) and GenomePaint (https://genomepaint.stjude.cloud) that enable dynamic visualization and custom data upload of genomic variants, gene expression, and sample information using either protein or genome as the primary data axis; the user-curated genomic landscape maps for cancer subtypes or pan-cancer studies can also be exported into image files to create figures suitable for multiple scientific publications. In addition, we developed specialized visualizations to present (i) genome view of chromatin state and gene expression using ChIP-seq and RNA-seq data generated from mouse/human retina (36) or patient-derived xenografts of pediatric solid tumors (37); (ii) subgroup clustering using methylation data in medulloblastoma (14) or gene expression data in B-cell acute lymphoblastic leukemia (B-ALL; ref. 38); and (iii) genotype/phenotype correlation for pediatric sickle cell patients and long-term survivors and pediatric cancer (27, 39). These expert-curated genomic and epigenomic landscape maps are not only valuable for presenting discoveries in published literature, they can also serve as an important resource for dynamic data exploration by the broad research community.
St. Jude Cloud Ecosystem
Raw and curated genomic data, analysis, and visualization tools are structured into the following three independent and interconnected applications on St. Jude Cloud to provide a secure, web-based ecosystem for integrative analysis of pediatric cancer genome data: (i) Genomics Platform for accessing data and analysis workflows, (ii) PeCan for exploring a curated knowledgebase of pediatric cancer, and (iii) Visualization Community for exploring published pediatric cancer genomic or epigenomic landscape maps and for visualizing user data using ProteinPaint or GenomePaint.
A user may work with the St. Jude Cloud ecosystem via open, registered, or controlled access. Although PeCan and Visualization Community are accessible in an open and anonymous manner, users must set up a St. Jude Cloud account (i.e., register) to run the analysis workflows or access RNA-seq expression data on the Genomics Platform. In accordance with the community practice for human genomic data protection, access to raw genomic data (e.g., WGS, WES, or RNA-seq) generated from patient samples follows a controlled access model, i.e., requiring the submission of a signed data access agreement that will be subsequently reviewed by a data access committee for approval. Since its debut in 2018, there are a total of 1,951 registered users of St. Jude Cloud Genomics Platform. As of July 9, 2020, 210 requests for access to raw genomic data have been granted to researchers at 78 institutes across 18 countries (Supplementary Fig. S2), and the median turnaround time for data access approval is 7 days. Overall, 18.8% (n = 49) of requests for data were rejected. Of these rejections, 67.4% (n = 33) were due to requests for data that did not fit the users' stated research goals, e.g., a request for germline-only or sickle cell datasets from a tumor study. The remaining 32.7% (n = 16) were from for-profit entities for which we are still investigating an appropriate approach for data sharing. There were no instances where a dataset was rejected for a scientific reason. Today, there are approximately 2,500 unique users per week on average accessing the St. Jude Cloud ecosystem.
Although Genomics Platform, PeCan, and Visualization Community are each a valuable resource for pediatric cancer research in their own right, working across all three within the St. Jude Cloud ecosystem provides a unique user experience that can simultaneously enhance data analysis and enrich the knowledgebase for pediatric cancer. As illustrated in Fig. 3, access to raw genomic data is equivalent to building a virtual research cohort on the St. Jude Cloud ecosystem, which can be accomplished by querying sample features using the data browser of Genomics Platform—a classic approach—or by selecting samples with specific molecular features (e.g., mutations or gene expression level) using PeCan. Upon approval, requested data are made available immediately within a private cloud-based project folder. User data can also be uploaded quickly and securely to the project folder through our data transfer tools, and projects can be shared with collaborators using the underlying DNAnexus Platform. The user may then analyze the data using the workflows on the Genomic Platforms, tools provided by the DNAnexus Platform, or their own containerized workflows. Alternatively, data can be downloaded to a user's local computing environment for analysis. Results produced by both local infrastructure and the Genomics Platform can be explored alongside data presented in the curated PeCan knowledgebase using visualization tools such as ProteinPaint or GenomePaint within the Visualization Community. The resulting data, post publication, can be integrated to PeCan to enrich the PeCan knowledgebase, whereas the landscape maps as well as graphs of sample subgroups prepared by researchers using ProteinPaint or other visualization tools can be shared on the Visualization Community for dynamic exploration. We present two use cases below to demonstrate this process.
Use Case 1: Classify Pediatric Cancers by RNA-seq Expression Profiling
Defining cancer subtypes by gene expression has provided important insight into the classification of pediatric (40–42) and adult cancers (43). To accomplish this on St. Jude Cloud, we analyzed gene expression profiles of pediatric brain (n = 447), solid (n = 302), and blood (n = 816) tumors using RNA-seq data from fresh-frozen samples which were generated by either retrospective research projects [e.g., PCGP and a St. Jude pilot clinical study (Clinical Pilot)] or prospective clinical genomics programs (e.g., G4K and RTCG). Gene expression values (Methods) were imported from the Genomics Platform and separated into the three categories of brain, solid, and blood tumors for subtype classification using t-SNE analysis (Fig. 4). A t-SNE analysis of the full dataset was also performed.
On t-SNE plots generated for blood, brain, and solid tumors, major cancer types form distinct clusters as expected (Fig. 4A–C). In brain tumor, subtypes known to have different developmental origin (44) such as WNT, SHH, and group 3/4 subtypes of medulloblastoma show clear separation (Fig. 4C). Interestingly, adamantinomatous craniopharyngioma (ACPG), a rare brain cancer derived from pituitary gland embryonic tissue, forms two distinct groups (denoted ACPG groups 1 and 2 on Fig. 4C) which cannot entirely be attributed to differences in tumor purity based on our examination of mutant allele fraction of CTNNB1, differential expression signature, and tumor section slides (Supplementary Table S2A–S2C; Supplementary Fig. S3A–S3E). Solid tumors show tight clusters reflecting the disease tissue type (Fig. 4B). Interestingly, a small number of metastatic osteosarcomas are separated from primary tumors (Fig. 4B, indicated by a circle); contamination of the tumor biopsy with lung tissue at the site of metastasis likely contributed to this expression difference (Supplementary Fig. S4; Supplementary Table S2D). Notably, Wilms tumors also cluster into two distinct groups, one of which is comprised entirely of samples from bilateral cases (Fig. 4B). This may reflect that divergence in gene transcription is caused by different genetic causes of Wilms bilateral versus unilateral cases, likely owing to germline mutations present in the bilateral cases (45). Blood cancers can be differentiated by their lineage with substructures recapitulating the subgroups defined by cytogenetic features or gene fusions/somatic mutations reported previously (ref. 38; Fig. 4A). Notably, examination of KMT2A (also known as MLL) rearranged leukemias (a subset of which is known to be mixed phenotype acute leukemia) reveals they cluster by their cellular lineage (i.e., B cell, T cell, or myeloid; Supplementary Fig. S5A and S5B), indicating their primary lineage has a greater influence than the KMT2A fusion on global gene expression profile.
These t-SNE plots can be explored interactively on the Visualization Community of St. Jude Cloud with options for highlighting one or multiple cancer subtypes or samples of interest defined by a user. Mouseover for an individual sample shows additional information such as age of onset, clinical diagnosis, and molecular driver of the cancer subtype (Supplementary Fig. S5B). They can also serve as reference maps for classifying user-provided patient samples—an application supported by our “RNA-Seq Expression Classification pipeline” on the Genomics Platform (Supplementary Fig. S5A). To demonstrate this utility, we used RNA-seq data of PAWNXH, an unclassified acute myeloid leukemia (AML) sample from Children's Oncology Group. By uploading the aligned RNA-seq BAM file of PAWNXH to St. Jude Cloud (Fig. 4D), a user can run “Rapid RNA-Seq” to perform fusion detection, which identifies a novel gene fusion, ZBTB7A–NUTM1 (Fig. 4E). Notably, NUTM1 fusion oncoprotein is on the FDA's Relevant Pediatric Molecular Target List (https://www.fda.gov/about-fda/oncology-center-excellence/pediatric-oncology) and has also been reported previously in pediatric ALL (38). Analysis by “RNA-Seq Expression Classification” shows that this sample clusters with AML instead of the two ALLs that harbor NUTM1 fusions (Fig. 4F; Supplementary Fig. S5B) in our cohort. This pattern is reminiscent of the KMT2A fusion–positive AMLs and ALLs which cluster primarily by their cellular lineage.
Use Case 2: Mutation Rates and Signatures across Pediatric Blood, Solid, and Brain Cancers
Investigation of mutational burden and signatures can unveil the mutational processes shaping the genomic landscape of pediatric cancer (15, 16, 33) at diagnosis or relapse. To examine mutational burden, we analyzed validated or curated coding and noncoding somatic variants from paired tumor and normal WGS data available for 958 samples of pediatric patients with cancer comprising more than 35 major subtypes of blood, solid, or brain cancers profiled by PCGP, Clinical Pilot, or G4K studies (Fig. 5A, left), 10 of which were not analyzed by previous pan-cancer studies (refs. 15, 16; Methods). Among blood cancers, the median genome-wide somatic mutation rates were 0.21, 0.28, and 0.33 per million bases (Mb) in AML (including AMKL), B-ALL, and T-cell acute lymphoblastic leukemia (T-ALL), respectively. The mutation rate of solid tumors was highly variable by subtype: retinoblastoma had the lowest mutation rate with 0.06 per Mb, whereas osteosarcoma and melanoma had the highest rates with 1.0 and 6.86 per Mb, respectively. Among the brain tumors, craniopharyngioma exhibited the lowest mutation rate with 0.02 per Mb in contrast to HGGs with 0.45 per Mb. Two hypermutators with extremely high mutation burdens were observed among the HGGs owing to mutations in MSH2 or POLE.
We detected 22 of the 60 published COSMIC mutational signatures (33) in addition to 2 recently identified therapy-induced signatures (20) in relapsed B-ALL samples (Fig. 5A, right). As expected, age-related signatures (i.e., COSMIC signatures 1 and 5) were present in nearly all pediatric cancers. APOBEC signatures (i.e., COSMIC signatures 2 and 13) were identified in ETV6–RUNX1 B-ALL, osteosarcoma, adrenocortical carcinoma, and thyroid cancer, as previously reported (7, 46–48). Both APOBEC signatures are present in an acute megakaryoblastic leukemia (Supplementary Fig. S6A), which was not reported in previous studies of AML (49, 50). As expected, UV light–induced signature 7 was detected in melanoma and a subset of B-ALLs (Supplementary Fig. S6B) and, interestingly, in a single case of anaplastic large cell lymphoma (a rare subtype of non-Hodgkin lymphoma). This sample was also positive for signature 15, which is associated with defective DNA mismatch repair. Further, the reactive oxygen species (ROS)–associated signature 18 was found in multiple cancer types including neuroblastoma, rhabdomyosarcoma, T-ALL, Ewing sarcoma, and several subtypes of B-ALL.
Therapy-related signatures were detected in several samples collected after treatment. The first was signature 22, found in a single hepatoblastoma tumor of an Asian patient that had a mutation rate >10 times higher than the other hepatoblastoma tumors (Supplementary Fig. S7A). Interestingly, signature 22 is associated with exposure to aristolochic acid, found in a Chinese medicinal herb (Aristolochia fangchi) that is known to be carcinogenic (51). Notably, the relapsed tumor from this patient had increased mutational burden accompanied by acquisition of COSMIC signature 35, which is known to be associated with exposure to cisplatin (Supplementary Fig. S7B), a chemotherapy drug used as part of the standard of care for hepatoblastoma (52). Signatures 35 and 31, also associated with exposure to platinum complexes (53) cisplatin and carboplatin, were found in osteosarcoma and ependymomas as previously reported (54), as well as in retinoblastoma, all of which use cisplatin or carboplatin for treatment. Signature 35 was also detected in one Ewing sarcoma from a patient who had a prior malignancy of ganglioneuroblastoma which was treated with carboplatin. It is notable that two signatures (currently designated as COSMIC signatures 86 and 87) proposed to be induced by ALL treatment were also detected exclusively in relapsed B-ALL samples (Supplementary Fig. S8A and S8B).
The mutational signatures assembled from our cohort can also be compared with mutational signatures in a cohort analyzed by the user, a function supported by the St. Jude Cloud Mutational Signatures tool (Supplementary Fig. S9). For example, we downloaded nine adult AML somatic mutation datasets profiled by WGS from the International Cancer Genome Consortium, performed mutational signature analysis on Genomics Platform, and selected pediatric AML for comparison. The results (Fig. 5B) showed that the ubiquitous and age-related signatures 1 and 5, as well as signature 18 (ROS-associated) were present in both cohorts (33). The adult AML also had signature 31 (cisplatin/carboplatin-induced), contributed by a single sample that also has the highest mutation burden. More than 80% of mutations in this outlier sample are contributed by signature 31, indicating it is likely a therapy-related AML. The two additional signatures present in the pediatric cohort, signatures 36 and 40, are similar to the ROS and age-related signature, respectively.
Discussion
Pediatric cancer is a disease comprised of many rare subtypes. Effective sharing of genomic data and a community effort to elucidate etiology are therefore critical to developing effective therapeutic strategies. St. Jude Cloud is designed to provide a data analysis ecosystem that supports multidisciplinary research on pediatric cancer by empowering laboratory scientists, clinical researchers, clinicians, and bioinformatics scientists. The PeCan portal enables navigation of a PeCan knowledgebase assembled from published literature, whereas the Visualization Community enables dynamic exploration of harmonized and curated data in the forms of landscape maps, cancer subgroups, and integrated views of the genome, transcriptome, and epigenome from the same cancer sample. Both apps are designed to be accessible openly by researchers without any formal computational training. Common use cases, such as assessing recurrence of a rare genomic variant or expression status of a gene of interest, are directly enabled by these two St. Jude Cloud apps, obviating the need to download data and perform a custom analysis. If a subset of samples identified through the initial data exploration warrants in-depth investigation, a comprehensive reanalysis can be performed either on the Genomics Platform app or on a user's local computing infrastructure. The complementarity among the three apps within the St. Jude Cloud ecosystem enables the optimal use of computational resources so that researchers can focus on innovative analyses leading to new insights.
User feedback has been critical to informing the trajectory of St. Jude Cloud development. To improve data querying, we developed a data browser within the Genomics Platform, which allows a user to select datasets by study, disease subtype, disease stage (e.g., diagnosis, relapse, or metastasis), sequencing type, and data type. Most recently, RNA-seq feature count data have been made available on the Genomics Platform, as these are commonly used for many downstream analyses. We envision an evolving expansion of our current data offerings to include epigenetic and three-dimensional genome data, new facets of our PeCan knowledgebase, nongenomics data, and a variety of additional visualization tools. A new app has been designed for better integration of orthotopic patient-derived xenograft models that are available on the Childhood Solid Tumor Network (ref. 37; raw genomic data accessible on the Genomics Platform) and Pediatric Brain Tumor Portal (55). Moving forward, the rich data resources on St. Jude Cloud may attract external methods developers to use pediatric cancer data—genomic or other data types—as the primary source for development, further expanding the analytic capability of St. Jude Cloud ecosystem and broadening the user base to researchers specializing in other diseases.
A key consideration of our data-sharing strategy is to provide access to the pediatric cancer research community as soon as possible, rather than holding data back for publication (which may take months or years). This is accomplished through the development of the RTCG deposition pipeline, a complex workflow involving verification of patient consent, deidentification, data harmonization, and quality checking. To our knowledge, this is the first instance of an institutional deposition of prospective clinical genomics data—WGS, WES, and RNA-seq—to the scientific research community. The RTCG workflow may serve as a model for other institutions envisioning similar initiatives on sharing data generated from clinical genomics programs with the external community. Currently, the two prospective sequencing projects, RTCG and G4K, have contributed >50% of the raw cancer WGS data on St. Jude Cloud. As of July 9, 2020, these datasets have been made accessible to 78 investigators from 53 institutions who applied for data access prior to publications of RTCG and G4K. RTCG data have expanded substantially from March to July 2020, at the height of the COVID-19 pandemic in the United States (Fig. 2C, left). We anticipate adding approximately 500 additional cases profiled by prospective clinical genomics per year at regular intervals. Data generated from RTCG and G4K are particularly enriched for rare pediatric cancer subtypes (Fig. 2C, right) enabling future research on new therapies that may be incorporated into patient care. New research has already benefited from comparing user data with data hosted on St. Jude Cloud. For example, Keenan and colleagues (56) gained new insight into a rare C11orf95 fusion in ependymoma by uploading and analyzing their RNA-seq samples using the RNA Classification workflow on St. Jude Cloud.
Although St. Jude Cloud currently hosts genomic data generated by St. Jude studies, we envision it will serve as a collaborative research platform for the broader pediatric cancer community in the future. User-uploaded data can be analyzed and explored alongside the wealth of curated and raw pediatric genomic data on St. Jude Cloud, and deposition of user data into St. Jude Cloud requires minimal effort. In this regard, St. Jude Cloud represents a community resource, framework, and significant contribution to the pediatric genomic sequencing data-sharing landscape. We also recognize that contemporary data-sharing models are shifting from centralized to distributed resources that servespecific communities. Such distributed repositories are currently not well connected, and considerable effort is required to move data or tools from one platform to another. The ultimate solution is likely to consist of a federated system for data aggregation, which has also been identified as a priority by participants in the first symposium of The Childhood Cancer Data Initiative (https://www.cancer.gov/news-events/cancer-currents-blog/2019/lowy-ccdi-symposium-childhood-cancer). This is particularly important for rare subtypes of pediatric cancer as illustrated in our use cases that analyzed subgroup classification in craniopharyngioma and mutational signatures in hepatoblastoma. An important aspect of future work will be the development of a coordinated effort for data federation across other pediatric genomic resources to enable proper study of these rare tumors.
Within the federated data-sharing paradigm, we envision a phased implementation approach. The first phase will likely be geared toward deploying analysis tools within the various genomic cloud platforms by “bringing the tools to the data.” The reasons for this initial approach are 2-fold: (i) data are typically much costlier to move around or duplicate than tools, a pressing problem within the genomic data-sharing paradigm at present; and (ii) legal and ethical constraints may hinder the movement of data but, generally speaking, rarely apply to analysis tools. We anticipate that the initial focus will involve deploying genomic analysis workflows in one of the various workflow languages like the Common Workflow Language and Workflow Description Language. In parallel, much work is needed by the providers of various cloud-based genomics platforms to robustly support the full specifications of these workflow languages and to optimize the process of compiling and execution of these workflows on their platform. The second phase of development will involve the development and support of common application programming interfaces (API) to exchange information within the federated data ecosystem. The implementation of these APIs will lay the foundation upon which applications can be built to enable sophisticated exploration of cancer data, but this development will not come without challenges. Specifically, permitted data use is not homogeneous across all datasets (e.g., the TARGET data access guidelines do not permit use of their data for methods development, whereas St. Jude Cloud does permit this), and verifying accessibility across multiple platforms for a specific application can be technically challenging to implement. These topics should be addressed by working groups pursuing a federated data ecosystem sooner rather than later.
In summary, St. Jude Cloud offers the largest cloud-based genomic data resource for pediatric cancer. With continued expansion of data content, development of new applications, and exploration of federated data sharing on this data-sharing ecosystem, we anticipate that it will serve as a key community infrastructure to accelerate research that will improve the precision of diagnoses, efficacy of treatments, and long-term survival of pediatric cancer and other childhood catastrophic diseases.
Methods
St. Jude Cloud Genomics Platform
St. Jude Cloud Genomics Platform is a web application for querying, selecting, and accessing raw and curated genomic datasets through a custom-built data browser. Genomic data storage is provided by Microsoft Azure which is accredited to comply with major global security and privacy standards, such as ISO 27001, and has the security and provenance standards required for Health Insurance Portability and Accountability Act (HIPAA)–compliant operation. By leveraging Microsoft Azure, DNAnexus provides an open, flexible, and secure cloud platform for St. Jude Cloud to support operational requirements such as the storage and vending of pediatric genomics data to users, along with an environment supportive of genomics analysis tools. DNAnexus supports a security framework compliant with all of the major data privacy standards [HIPAA, Clinical Laboratory Improvement Amendments (CLIA), Good Clinical Practices (GCP), 21 Code of Federal Regulations (CFR) Parts 22, 58, 493, and European data privacy laws and regulations] and interfaces with the St. Jude Cloud Genomics Platform. Application for data access can be made using our streamlined electronic process via Docusign (for requests made within the United States) or a manual process that requires downloading, filling out, signing, and uploading the data access agreement. Upon approval of a data access request by the relevant data access committee(s), St. Jude Cloud Genomics Platform coordinates the provision of a free copy of the requested data to the user via the DNAnexus API into a secure, private workspace within the DNAnexus platform which can also be used for custom data upload.
As of July 9, 2020, the Tools section of St. Jude Cloud Genomics Platform provides access to eight end-to-end St. Jude Cloud workflows optimized for the DNAnexus environment. When a user wishes to run a St. Jude Cloud workflow, the St. Jude Cloud Genomics Platform creates a new project folder and vends a copy of the tool to this folder where a user may import St. Jude Cloud genomics data or even upload their own datasets. DNAnexus provides both a command line option for batch execution of operations and a graphical user interface for job submission and execution.
Genomic Sequencing Data
We have received written informed consent from all patients that permits hosting of their genomic data and limited clinical information for research purposes. Raw genomic data can be requested and accessed on St. Jude as mapped next-generation sequencing reads in the BAM (57) file format. The data were generated from paired tumor–normal samples of pediatric patients with cancer, germline-only samples of long-term survivors of pediatric cancer, and germline-only samples of pediatric sickle cell patients as summarized in Fig. 2A. Paired tumor–normal datasets include retrospective data of 1,610 patients from the St. Jude/Washington University PCGP (2), 78 patients from Clinical Pilot (21), prospective data of 309 patients from the G4K study (https://clinicaltrials.gov/ct2/show/NCT02530658), and 1,038 patients from our RTCG initiative. The germline-only dataset of pediatric cancer survivors includes 4,833 participants of SJLIFE (22), a study that brings long-term survivors back to St. Jude Children's Research Hospital for extensive clinical assessments, and 2,912 participants of the CCSS (23), a 31-institution cohort study of long-term survivors. Primary diagnosis of cancer subtypes for both the pediatric cancer and survivorship cohorts is provided both as (i) the value provided at data submission time from the lab or principal investigator (generally unaltered but updated as we receive new information) and as (ii) the harmonized diagnosis value matching the closest classification present in OncoTree (oncotree.mskcc.org). Germline-only data of pediatric sickle cell patients represent 807 patients from the Sickle Cell Genome Project, an initiative that is part of the Sickle Cell Clinical Research and Intervention Program (58).
Each of these studies represents an individual data access unit within St. Jude Cloud and was approved for data sharing by the St. Jude Children's Research Hospital Institutional Review Board (IRB). Further, data are shared only where patient families have consented to research data sharing. For each cohort (i.e., pediatric cancer, survivor, or sickle cell), a data access committee has been formed to assess and subsequently approve or reject data access requests.
The samples presented in this article were based on data released on St. Jude Cloud as of July 9, 2020. Metadata for these samples were updated through November 9, 2020.
Genomic Data Harmonization and Quality Control Check
WGS and WES data were mapped to GRCh38 (GRCh38_no_alt) using BWA-MEM (59) followed by variant calling using GATK 4.0 HaplotypeCaller (60), both reimplemented by Microsoft Genomics Service (https://azure.microsoft.com/mediahandler/files/resourcefiles/accelerate-precision-medicine-with-microsoft-genomics/Accelerate_precision_medicine_with_Microsoft_Genomics.pdf) on Microsoft Azure, to generate BAM and genomic VCF files for each sample. Each type of genomic sequencing data (WGS and WES) is evaluated separately after sequencing and mapping. A quality check process involves confirmation of sequence file integrity using Samtools (57) quickcheck and Picard ValidateSamFile and evaluation of the quality, coverage distribution, and mapping statistics using Samtools flagstat, FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc), and Qualimap 2 (61) bamqc. The details of the process are described in the respective request for comment (RFC; https://github.com/stjudecloud/rfcs/blob/rfcs/qc-workflow/text/0002-quality-check-workflow.md).
RNA-seq data were mapped to GRCh38_no_alt using a customized workflow (https://stjudecloud.github.io/rfcs/0001-rnaseq-workflow-v2.0.0.html). Briefly, RNA-seq reads were aligned using the STAR aligner in two-pass mode (62) to the human hg38 genome build using gene annotations provided by Gencode v31 gene models (https://www.gencodegenes.org/human/release_31.html). Subsequently, Picard (http://broadinstitute.github.io/picard) SortSam was used to coordinate sort the BAM file, and Picard ValidateSamFile confirmed that the aligned BAM was consistent with the format specification. Finally, gene-level counts were generated using HTSeq-count (63) using Gencode v31 gene models. For quality control (QC) check, we used Qualimap 2 RNA-seq and an in-house “NGSderive strandedness” script (https://github.com/stjudecloud/ngsderive) that infers strandedness using GENCODE v31 gene annotations.
RTCG Protocol
Our IRB-approved RTCG protocol (St. Jude IRB #19–0099) comprises a series of semiautomated steps that enable the transfer of prospective clinical genomics and selected patient clinical data to St. Jude Cloud. Transfer of this data to St. Jude Cloud is permitted only when patient consent is obtained for clinical genomic testing, research use, and St. Jude Cloud data sharing. This process, depicted in Supplementary Fig. S1, begins with patient registration and the assignment of Protected Health Information (PHI)/Medical Record Number (MRN) and entry to our EMR database (EMR DB) after which an initial clinical diagnosis is made by the attending physician. Every St. Jude patient has the option of undergoing clinical genomics sequencing as part of our St. Jude clinical genomics service. If patient consent is obtained, the attending physician places an order with the Clinical Genomics team to perform the three-platform sequencing of WGS, WES, and transcriptome sequencing in our CLIA-certified, College of American Pathologists (CAP)-accredited laboratory (21). The resulting sequence data are transferred to an isolated clinical computing environment for automated analysis, manual curation, and case presentation to our molecular tumor board (MTB), ultimately producing a final case report. Updates to the diagnosis of the patient throughout this process are routine, and we regularly update records based on the most up-to-date information.
Following the initial MTB sign out of a case report, an embargo period of 30 days is maintained to enable updates or corrections of files prior to the transfer of deidentified genomic data to the research computing environment. Further, clinical information is retrieved from the EMR DB and collated within the research computing environment. After an additional embargo period of 90 days, patient genomic data are transferred to St. Jude Cloud upon verification of consent for cloud data sharing. Once within St. Jude Cloud, data harmonization and QC checks are performed as described above prior to public release. Samples are tagged with a rolling publication embargo date which must pass before the data can be used in any external publication. Importantly, patient consent is periodically rechecked as updates may require the removal of patient data from the research computing and St. Jude Cloud.
Identification of Rare Pediatric Cancer Samples among Prospective Clinical Genomics Cohorts
The annual incidence (number of patients per million) of cancer diagnoses (International Classification of Childhood Cancer) between the ages of 0 and 17 years in the United States was calculated using data from the NCI SEER program (https://seer.cancer.gov/csr/1975_2016) for the period of 1990 to 2016. Of these, only International Classification of Disease for Oncology, third edition, histology subgroupings with an estimated number of 200 or fewer new patients per million per year were considered rare pediatric cancer subtypes. These estimates were calculated by multiplying the annual incidence per million by 74.2 million, the 2010 census estimate of the number of people in the United States between the ages of 0 and 17 years. These data were used to determine which of the subtypes unique to the prospective clinical genomics (G4K or RTCG) datasets represented rare cancer subtypes for the St. Jude Cloud platform.
Pediatric Cancer Patient Sample Diagnosis Subtype Curation
The diagnosis subtype annotations for samples of pediatric patients with cancer were normalized to a consistent nomenclature across each of the PCGP, Clinical Pilot, G4K, and RTCG sample collections. For PCGP samples, previous associated publications were consulted to ensure accuracy of diagnosis subtype assignment within St. Jude Cloud. For patient samples from Clinical Pilot, G4K, and RTCG, clinical genomics pathology reports were used to assign or verify diagnosis subtype annotations. Upon arriving at a concise set of diagnosis subtype annotations across all patient samples on St. Jude Cloud, diagnosis subtype abbreviations were assigned (Supplementary Table S3) along with the closest matching OncoTree (oncotree.mskcc.org) Identifier.
Expression Analysis of Pediatric Cancer
St. Jude Cloud tumor RNA-seq expression count data were generated using HTSeq version 0.11.2 (63) in conjunction with GENCODE (release 31) gene annotations based on the August 2019 release. Of these, only diagnostic, relapse, and metastatic samples from fresh-frozen tissue (i.e., excluding formalin-fixed, paraffin-embedded samples) were included. We removed samples where the associated RNA-seq data involved multiple read lengths or the computationally derived strandedness (InferExperiment; ref. 64) was unclear (samples sequenced using a stranded protocol having less than 80% reverse-oriented stranded read pairs were deemed “unclear strandedness”). When patient sample RNA-seq data were available in both PCGP and Clinical Pilot studies, we only considered the Clinical Pilot data. The analysis only included RNA-seq generated from Illumina GAIIX, HiSeq2000, HiSeq2500, HiSeq4000, NextSeq, or NovaSeq6000 sequencing platforms. These QC steps resulted in a total of 1,574 qualified RNA-seq samples which could be queried using the data browser on the St. Jude Cloud Genomics Platform. Once selected, HTSeq feature count files for each of these samples were imported into the St. Jude Cloud “RNA-Seq Expression Classification” tool for analysis. Briefly, this tool first reads gene features from a GENCODE gene model (release 31) and then aggregates the feature counts from the HTSeq files into a single matrix for all samples under consideration. Next, covariate information is retrieved from sample metadata and added to the matrix. Filters are then applied to remove nonprotein coding genes and genes exhibiting low expression (<10 read count). This tool also enables subgrouping of samples into “blood,” “solid,” and “brain” tumor categories (Supplementary Table S3) of which there were a total of 816, 302, and 447 samples, respectively (note the sum difference with above-mentioned 1,574 is from 9 germ cell tumors not considered in this analysis). Gene expression analysis was performed with R (3.5.2) using the DESeq2, Rtsne, and sva packages. Gene expression within each of the blood, solid, and brain was normalized using DESeq2′s (65) variance stabilizing transformation, and batch effects [read length (bp)], library strandedness (stranded forward, stranded reverse, and unstranded), RNA selection method (PolyA vs. Total RNA), and read pairing (single- vs. paired-end) were removed using ComBat (sva package; ref. 66). The top 1,000 most variably expressed genes based on median absolute deviation were then selected from each of the three major cancer types after which two-dimensional t-SNE was performed according to ref. 42 using a perplexity parameter of 20. Two-dimensional plots for each cancer type were generated using an interactive t-SNE plot viewer we developed (Supplementary Fig. S5A). The gene expression analysis methodology described above has been incorporated into the St. Jude Cloud RNA-Seq Expression Classification workflow.
Differential gene expression analysis for comparison of both osteosarcoma and craniopharyngioma subgroups was performed using the WARDEN pipeline on St. Jude Cloud. Here, aligned BAM files were first converted to FASTQ files using bedtools bamtofastq (67). FASTQ files were submitted to WARDEN using default parameters. ENRICHR (68, 69) was used to perform gene set enrichment analysis using BioGPS Human Gene Atlas, WikiPathways 2019, and GO Molecular Function 2018 gene categories. Volcano plots were generated using STATA/MP 15.1. ACPG sample tissue section slides were stained with hematoxylin and eosin and reviewed by a board-certified neuropathologist (B.A. Orr).
Somatic Variant data, Mutation Rate, and Mutational Signature Analysis
Somatic SNVs and indels were analyzed using paired tumor–normal WGS or WES analysis as described previously (21, 70). Somatic copy number variations (CNV) were computed using the CONSERTING algorithm (71) followed by manual review of coverage and B-allele fraction. The somatic SNVs/indels and CNVs were lifted over to GRCh38 and uploaded to St. Jude Cloud as VCF and CNV files.
Mutation rate and signature analysis was performed using all patient tumor sample VCF files from PCGP, Clinical Pilot, and G4K studies. Variants were required to be confirmed valid by capture validation or determined to be of high confidence based on an internal postprocessing pipeline (70). The dataset includes 10 subtypes (i.e., AMKL, non-Hodgkin lymphoma, kidney cancer, germline cell tumor, thyroid cancer, nonrhabdomyosarcoma soft-tissue sarcoma, craniopharyngioma, low-grade glioma, melanoma, and choroid plexus carcinoma) that were not included in previous pancancer studies (15, 16). When a patient tumor sample VCF file was available in both PCGP and Clinical Pilot studies, we considered only the Clinical Pilot data. The mutation rate was calculated for each subgroup and defined as the number of somatic SNVs per MB. For this purpose, we included only WGS samples and used somatic SNVs in exonic as well as nonexonic, nonrepetitive regions (i.e., regions not covered by RepeatMasker tracks, the sum of these two regions totaling 1,445 Mb).
To identify mutational signatures in these WGS samples, we first determined the trinucleotide context of each somatic SNV using an in-house script, and each sample was summarized based on the number of mutations in each of the 96 possible mutation types (mutation plus trinucleotide context; ref. 48). The presence and strength of 65 COSMIC signatures (33, 72) and two therapy-induced mutational signatures which we discovered previously (20) were then analyzed using SigProfilerSingleSample (73) version 1.3 using the default parameters. We selected SigProfilerSingleSample, as it requires greater stringency to prevent overfitting which can lead to spurious signatures. This was accomplished by requiring a cosine increase of 0.05 or above to include a signature, and to include ubiquitous signatures 1 and 5 preferentially prior to detecting additional signatures. Samples explained by signatures with a cosine similarity of less than 0.85 were excluded. The proportion of samples (range, 0–1) within each cancer subtype category was then displayed in a heatmap showing patterns in different cancer subtypes. Mutational signatures within a subtype were only displayed where prevalence exceeds 1%. For the detection of signature 22 in SJST030137, we assigned mutations into three clusters—diagnosis-specific (present in SJST030137_D1 sample), relapse-specific (present in SJST030137_R1 sample), and shared (present in both samples)—and then performed signature analysis with SigProfilerSingleSample on each mutation cluster. The final diagnosis signature spectrum was achieved by summing the signatures in the diagnosis-specific and shared mutation clusters, whereas the relapse spectrum was the sum of the relapse-specific and shared clusters. This increased sensitivity of detection of signature 22, which was otherwise obscured in the relapse sample due to an increased mutation burden associated with the cisplatin signature.
For mutational signature analysis, samples were segmented by mutation burden. Samples with 400 or more mutations (485 samples) were analyzed for the full set of COSMIC signatures as these samples have sufficient number of somatic mutations to ensure a robust analysis. Samples with fewer than 400 mutations (583 samples) were analyzed for a core set of 13 signatures (1, 2, 3, 5, 7a, 7b, 7c, 7d, 8, 13, 18, 36, and 40) which can be reliably detected in low mutation burden samples and are common in pediatric cancers.
Data and Code Availability
All data are available on St. Jude Cloud (https://www.stjude.cloud). We have created a permalink (https://pecan.stjude.cloud/permalink/stjudecloud-paper) within St. Jude Cloud that contains updated links to all of the below information, should the location of any of these resources be updated after this article's publication date. Interactive t-SNE RNA-seq expression maps are available as a collection within the St. Jude Cloud Visualization Community at https://viz.stjude.cloud/stjudecloud/collection/stjudecloud-paper. RNA-seq–derived HTSeq count data for samples considered in Use Case 1: Expression landscape of pediatric cancers, and somatic VCF files used for mutation burden and mutational signatures analysis in Use Case 2: Mutation rates and signatures across pediatric blood, solid and brain cancers can be accessed through the St. Jude Cloud platform data browser at https://platform.stjude.cloud/data/publications?publication_accession=SJC-PB-1020. The pipeline used to generate the RNA-seq expression counts is documented in the “RNA-Seq v2” pipeline RFC (https://stjudecloud.github.io/rfcs) which also allows users to provide feedback. The workflow definition is available in our workflows repository (http://github.com/stjudecloud/workflows). The code for generating the t-SNEplot given a set of samples from St. Jude Cloud and a set of zero or more user query samples is defined in the “expression-classification” repository (https://github.com/stjudecloud/expression-classification). The code for generating the mutational signatures plot with zero or more user query samples is available in the “mtsg” repository (https://github.com/stjudecloud/mtsg).
Authors' Disclosures
T. Nguyen reports other support outside the submitted work. A.S. Pappo reports personal fees from Merck, Loxo, Bayer, and Debbio outside the submitted work. M.J. Weiss reports personal fees from Novartis and Beam Therapeutics outside the submitted work. G.T. Armstrong reports grants from NIH during the conduct of the study. C.G. Mullighan reports personal fees from Illumina during the conduct of the study; and grants and personal fees from Pfizer, grants from AbbVie and Loxo Oncology, and personal fees from Amgen outside the submitted work. G. Miller reports St. Jude is a recipient of a Microsoft AI for Health philanthropic grant for hosting of the St. Jude Cloud data and infrastructure. R. Daly reports other support outside the submitted work. No disclosures were reported by the other authors.
Authors' Contributions
C. McLeod: Conceptualization, resources, data curation, software, supervision, methodology, writing–original draft, project administration, writing–review and editing. A.M. Gout: Conceptualization, data curation, software, formal analysis, supervision, validation, investigation, visualization, methodology, writing–original draft, writing–review and editing. X. Zhou: Resources, software, supervision, visualization. A. Thrasher: Resources, data curation, software, validation, investigation, visualization, methodology, writing–review and editing. D. Rahbarinia: Resources, data curation, software, validation, investigation, visualization, methodology, writing–review and editing. S.W. Brady: Conceptualization, data curation, software, formal analysis, supervision, validation, investigation, visualization, methodology, writing–original draft, writing–review and editing. M. Macias: Conceptualization, resources, data curation, software, investigation, visualization, methodology.K. Birch: Conceptualization, resources, data curation, software, investigation, visualization, methodology. D. Finkelstein: Resources, data curation, formal analysis, investigation, visualization, writing–review and editing. J. Sunny: Resources, data curation, software, methodology. R. Mudunuri: Resources, data curation. B.A. Orr: Investigation, writing–review and editing. M. Treadway: Resources, data curation. B. Davidson: Resources, software, supervision. T.K. Ard: Resources, software. A. Chiao: Resources, software, validation. A. Swistak: Resources, software, visualization. S. Wiggins: Resources, software, investigation, visualization.S. Foy: Resources, software, formal analysis, validation, investigation, visualization, methodology, writing–original draft, writing–review and editing. J. Wang: Resources, data curation, software, visualization, methodology. E. Sioson: Resources, data curation, software, visualization, methodology. S. Wang: Resources, data curation, software, visualization, methodology. J.R. Michael: Resources, software, methodology. Y. Liu: Software, methodology, writing–review and editing. X. Ma: Software, investigation, methodology, writing–review and editing. A. Patel: Resources, data curation, software. M.N. Edmonson: Software, writing–review and editing. M.R. Wilkinson: Resources, data curation, software, investigation. A.M. Frantz: Data curation, software, investigation, methodology, writing–review and editing. T.-C. Chang: Resources, software, methodology. L. Tian: Software, investigation, methodology.S. Lei: Software, investigation. S.M.A. Islam: Resources, software. C. Meyer: Resources, software. N. Thangaraj: Resources, software. P. Tater: Resources, software. V. Kandali: Resources, software. S. Ma: Resources, software, supervision. T. Nguyen: Resources, software, supervision. O. Serang: Resources, software, supervision, visualization. I. McGuire: Resources, software, visualization.N. Robison: Resources, software. D. Gentry: Resources, software, methodology. X. Tang: Resources, software, methodology. L.E. Palmer: Resources, data curation, software, supervision, methodology.G. Wu: Resources, data curation, software, supervision, methodology. E. Suh: Resources, software. L. Tanner: Resources, software. J. McMurry: Resources, software, investigation. M. Lear: Resources, data curation, investigation, methodology. A.S. Pappo: Resources, data curation, investigation, methodology. Z. Wang: Resources, data curation, software, investigation, methodology. C.L. Wilson: Resources, data curation, supervision, investigation, methodology. Y. Cheng: Resources, data curation, software, investigation.S. Meshinchi: Data curation, supervision, investigation, methodology. L.B. Alexandrov: Data curation, software, investigation, methodology. M.J. Weiss: Supervision, investigation, methodology.G.T. Armstrong: Resources, data curation, investigation. L.L. Robison: Resources, data curation, software, supervision, investigation, methodology. Y. Yasui: Resources, data curation, investigation, methodology. K.E. Nichols: Data curation, investigation. D.W. Ellison: Resources, data curation, investigation. C. Bangur: Resources, software, supervision, methodology. C.G. Mullighan: Conceptualization, resources, data curation, software, supervision, investigation, methodology. S.J. Baker: Resources, data curation, software, funding acquisition, investigation, methodology. M.A. Dyer: Conceptualization, resources, data curation, software, supervision, funding acquisition, methodology. G. Miller: Conceptualization, resources, software, supervision, funding acquisition, methodology. S. Newman: Conceptualization, data curation, software, formal analysis, supervision, funding acquisition, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing. M. Rusch: Conceptualization, resources, data curation, software, supervision, validation, methodology. R. Daly: Resources, software, funding acquisition, methodology. K. Perry: Conceptualization, resources, software, supervision, funding acquisition, investigation, methodology. J.R. Downing: Conceptualization, resources, supervision, funding acquisition, investigation. J. Zhang: Conceptualization, data curation, software, formal analysis, supervision, funding acquisition, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing.
Acknowledgments
We wish to thank all St. Jude patients and their families for making this endeavor possible by contributing their data toward the advancement of cures for pediatric catastrophic disease. We would like to thank the generous support from the Microsoft AI for Good program for providing free storage for the data in St. Jude Cloud through Microsoft Azure and the Microsoft Genomics program for supplying free Microsoft Genomics Service runs for WGS and WES harmonization. We would like to thank Kevin Rodell and Judson Althoff of Microsoft for initiating the St. Jude/Microsoft Collaboration and Michael Gagne for his tireless support of their continued collaboration. We would like to thank the generous support of DNAnexus in their combined efforts to create a secure cloud platform on top of their existing platform. We would like to acknowledge the contribution by members of the St. Jude Biorepository and Clinical Genomics teams for their assistance in developing the RTCG pipeline. We would like to thank: Katherine Steuer, John Bailey, and the broader St. Jude legal department for their assistance in developing the legal components needed to sustain this effort; Elroy Fernandes, Kathy Price, and the St. Jude IRB for their assistance in verifying patient consent for deposition of data into St. Jude Cloud; Dr. Tom Merchant for consultation on treatment protocols for pediatric patients with ACPG; Drs. David Wheeler, Jennifer Neary, Tim Shaw, and Antonina Silkov for their help in curating the sample information of RTCG and the analysis of ACPG samples; Dr. Diane Flasch for critical review of the manuscript; and Drs. Tanja Gruber and Anna Hagstrom for assistance with the clarification of the lineages of MLL-rearranged infant ALL. We thank all the users who have provided critical feedback, in particular Drs. Jackie Norrie, Lawryn Kasper, and Laura Hover. This work is funded as a St. Jude Blue Sky initiative and is supported in part by the National Cancer Institute of the National Institutes of Health under Award Number R01CA216391 to J. Zhang. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.