Abstract
Childhood cancer survivorship studies generate comprehensive datasets comprising demographic, diagnosis, treatment, outcome, and genomic data from survivors. To broadly share this data, we created the St. Jude Survivorship Portal (https://survivorship.stjude.cloud), the first data portal for sharing, analyzing, and visualizing pediatric cancer survivorship data. More than 1,600 phenotypic variables and 400 million genetic variants from more than 7,700 childhood cancer survivors can be explored on this free, open-access portal. Summary statistics of variables are computed on-the-fly and visualized through interactive and customizable charts. Survivor cohorts can be customized and/or divided into groups for comparative analysis. Users can also seamlessly perform cumulative incidence and regression analyses on the stored survivorship data. Using the portal, we explored the ototoxic effects of platinum-based chemotherapy, uncovered a novel association between mental health, age, and limb amputation, and discovered a novel haplotype in MAGI3 strongly associated with cardiomyopathy specifically in survivors of African ancestry.
Significance: The St. Jude Survivorship Portal is the first data portal designed to share and explore clinical and genetic data from childhood cancer survivors. The portal provides both open- and controlled-access features and will fulfill a wide range of data sharing needs of the survivorship research community and beyond.
Introduction
The 5-year survival probability of childhood cancer has increased dramatically from approximately 60% in the 1970s to more than 85% today in the United States (https://seer.cancer.gov, RRID:SCR_006902; refs. 1, 2). This increased survival probability has led to a growing survivor population at risk of developing a wide range of adverse health outcomes that are attributable to cancer and/or its treatment. Such adverse outcomes include premature mortality, organ dysfunction, subsequent neoplasms, adverse socioeconomic outcomes, psychosocial challenges, and an overall reduction in quality-of-life (3–8). To mitigate these outcomes, major research efforts have focused on identifying their underlying causes, associated risks, and the patient subgroups most at risk. Large-scale, longitudinal studies of childhood cancer survivors, such as the St. Jude Lifetime Cohort (SJLIFE; refs. 9, 10) and the Childhood Cancer Survivor Study (CCSS; refs. 11, 12), have generated comprehensive datasets on survivors, including demographic, diagnosis, treatment, clinical assessment, chronic health condition, self-reported, and whole-genome sequencing (WGS) data (10, 12). These datasets are invaluable resources for the survivorship research community and have been used in hundreds of publications for survivor-based research over the past 25 years (see https://sjlife.stjude.org/about/publications.html and https://ccss.stjude.org/published-research/publications.html).
Based on the achievements by the SJLIFE and CCSS studies, an interactive data portal to publicly share this data will further advance this field by addressing unmet data access needs. Currently, the only open-access data available for these cohorts are static tables of summary statistics that provide limited information for project conception. To access and analyze the protected individual-level data, investigators must undergo a cycle of application, review, and approval by SJLIFE and CCSS committees for each new research proposal. Additionally, accessing the WGS-based genetic data, which are available as raw sequencing and variant calling files exceeding hundreds of terabytes in size, requires investigators to dedicate significant time and resources to acquiring the necessary computational infrastructure and bioinformatics expertise for analysis.
To improve the accessibility of cancer survivorship datasets, we developed the St. Jude Survivorship Portal (https://survivorship.stjude.cloud), the first data portal for sharing and exploring data from childhood cancer survivors. The portal hosts extensive clinical data and harmonized germline WGS variant genotype data collected from more than 7,700 survivors of childhood cancer from the SJLIFE and CCSS cohorts. The portal provides open access to clinical variables and genetic variants through an interactive data dictionary and a genome browser specialized for population genetics analysis. Summary statistics of variables are computed dynamically on the portal based on the custom parameters specified by the user, such as cohort selection/filtering and variable customizations. The portal provides a range of features including interactive charts, genome exploration, statistical testing, cumulative incidence, and regression analysis, as well an access-controlled interface for downloading individual survivor data for offline analysis. These features will enable on-demand access and open-ended exploration, allowing the St. Jude Survivorship Portal to serve as a valuable resource for the survivorship research community and beyond.
Results
Overview of the Portal
The St. Jude Survivorship Portal is a free, open-access data portal that hosts data from two childhood cancer survivorship cohorts: the SJLIFE and the CCSS (Fig. 1A; Supplementary Fig. S1; Supplementary Table S1). A total of 5,053 SJLIFE survivors and 2,688 CCSS survivors were included in the portal based on inclusion criteria described in “Methods.” Cohort phenotypic data consist of variables ranging from demographics, cancer diagnoses, treatments, clinical assessments, graded chronic health conditions, and self-reported symptoms of survivors with up to several decades of follow up (Supplementary Table S2). We have organized these data into a hierarchical data dictionary that users can easily navigate through to identify variables of interest. Cohort genetic data consist of germline genotypes of single nucleotide variants (SNVs) and short insertions and deletions (INDELs) detected by WGS. Users can browse genetic variants and perform genetic association analysis on-the-fly through the portal’s open-access genome browser. We also used this genetic data to precompute additional genetic information, including genetically defined ancestries, ancestry-specific principal components (PCs), linkage disequilibrium (LD) maps, and polygenic risk scores (PRSs; see “Methods”), which are available through either the data dictionary or genome browser on the portal.
Overview of the St. Jude Survivorship Portal. A, Survivorship cohorts and data stored on the portal. B, Overview of portal features. Navigation tabs of the portal are shown in the middle. The “COHORT” tab is used to select a cohort. The “FILTER” tab is used to refine the selected cohort by specifying variables. The “GROUPS” tab is used to define custom groups for comparative analyses. The “CHARTS” tab is used to launch features for analyzing, visualizing, and exporting the cohort data. These features include the data dictionary, summary plots, cumulative incidence analysis, regression analysis, genome browser, and data download. All features are open access, except for the data download feature, which is under access control. LD, linkage disequilibrium; PC, principal component.
Overview of the St. Jude Survivorship Portal. A, Survivorship cohorts and data stored on the portal. B, Overview of portal features. Navigation tabs of the portal are shown in the middle. The “COHORT” tab is used to select a cohort. The “FILTER” tab is used to refine the selected cohort by specifying variables. The “GROUPS” tab is used to define custom groups for comparative analyses. The “CHARTS” tab is used to launch features for analyzing, visualizing, and exporting the cohort data. These features include the data dictionary, summary plots, cumulative incidence analysis, regression analysis, genome browser, and data download. All features are open access, except for the data download feature, which is under access control. LD, linkage disequilibrium; PC, principal component.
Navigation through the St. Jude Survivorship Portal is guided by four main tabs: “COHORT,” “CHARTS,” “GROUPS,” and “FILTER” (Fig. 1B; Supplementary Tutorial). The portal begins in the “COHORT” tab where users can select either the SJLIFE cohort, CCSS cohort, or both. The “FILTER” tab can then be used to refine the cohort by one or more variables in AND/OR combinations. For example, the cohort can be restricted to those survivors who were diagnosed with cancer before age 5 years and who were exposed to either anthracycline chemotherapy or chest radiotherapy (Fig. 1B). In the “GROUPS” tab, users can divide the cohort into custom groups for comparative analysis (e.g., exposed group vs. nonexposed group). In the “CHARTS” tab, users can launch a set of features to explore, analyze, visualize, or export the data of the cohort/groups defined by the COHORT, FILTER, and GROUPS tabs. The “data dictionary” feature organizes the data into a hierarchical structure that users can browse or search through to identify variables of interest. The “summary plots” feature computes summary statistics of variables and displays these statistics as bar charts, violin plots, and scatter plots. Statistical analysis tools, such as “cumulative incidence analysis” and “regression analysis”, enable users to perform real-time analysis on the stored survivorship data. The “genome browser” feature allows users to explore and analyze the genetic data of survivors. These open-access features allow users to explore and analyze survivorship data within the portal environment. Alternatively, if users prefer to use other analytical or visualization tools, they may use the “data download” feature to download individual-level data from the portal. This feature is under controlled access and users must request access approval from the data access committee through St. Jude Cloud (13) to download data from the portal (Supplementary Tutorial).
In the subsequent sections, we demonstrate the utility and power of the portal’s features through multiple use-cases of survivorship research. All use-cases and figures were conducted using open-access information on the portal.
Exploring the Phenotypic and Genetic Data using the Data Dictionary and Genome Browser
Survivor cohort data can be explored on the portal using the data dictionary and genome browser (Fig. 2A). The data dictionary organizes phenotypic and precomputed genetic variables hierarchically, allowing users to easily explore the data by browsing through branches and sub-branches of the dictionary (Fig. 2B). When a variable is selected from the dictionary, its summary statistics are automatically computed and displayed as a chart. The type of chart displayed depends on the type of variable selected. A categorical variable, such as diagnosis group, will be displayed as a bar chart that summarizes the cancer diagnosis distribution across the cohort (Fig. 2B). This chart can then be overlaid by another variable, such as genetically defined race, to determine the ancestry breakdown of each diagnosis group (Fig. 2B). A continuous variable, such as treatment dosage or polygenic risk score, can be viewed either as a histogram using a bar chart or as a continuous distribution using a violin plot (Supplementary Fig. S2). A chronic health condition variable, such as subsequent neoplasms, can be viewed either as a bar chart or a cumulative incidence plot (Supplementary Fig. S2). Users can customize these charts in various ways, including grouping categories, customizing bins, overlaying variables, converting chart types, and more (Supplementary Fig. S2).
Exploration of phenotypic and genetic data on the portal. A, Navigation tabs with the “CHARTS” tab selected. The data dictionary and genome browser buttons are highlighted. B, Hierarchical organization of the data dictionary, which contains branches (+/− gray buttons) and variables (blue buttons). Selection of two variables (“Diagnosis Group” and “Genetically defined race”) creates a stratified bar chart. C, Genome browser with tracks for genetic variants, gene models, and LD values at the ARID5B locus. In the variants track, variants are color-coded according to their LD values relative to the selected variant (red box). Variants are also plotted according to their −log10(P value) computed by the group comparison shown at the top. In this comparison, allele frequencies of variants are compared between group 1 (survivors of European ancestry diagnosed with ALL) and group 2 (gnomAD noncancer cohort) using the Fisher’s Exact Test. Note that group 1 is specified using dictionary variables shown in B. AFR, African ancestry; EAS, Asian ancestry; EUR, European ancestry.
Exploration of phenotypic and genetic data on the portal. A, Navigation tabs with the “CHARTS” tab selected. The data dictionary and genome browser buttons are highlighted. B, Hierarchical organization of the data dictionary, which contains branches (+/− gray buttons) and variables (blue buttons). Selection of two variables (“Diagnosis Group” and “Genetically defined race”) creates a stratified bar chart. C, Genome browser with tracks for genetic variants, gene models, and LD values at the ARID5B locus. In the variants track, variants are color-coded according to their LD values relative to the selected variant (red box). Variants are also plotted according to their −log10(P value) computed by the group comparison shown at the top. In this comparison, allele frequencies of variants are compared between group 1 (survivors of European ancestry diagnosed with ALL) and group 2 (gnomAD noncancer cohort) using the Fisher’s Exact Test. Note that group 1 is specified using dictionary variables shown in B. AFR, African ancestry; EAS, Asian ancestry; EUR, European ancestry.
The second navigation tool is the genome browser, which allows users to navigate to genomic loci and examine the germline genetic variants that exist in the cohort. Variants can be analyzed on-the-fly within the genome browser to determine their associations with a user-defined phenotype. LD maps for European and African ancestries can be viewed alongside genetic variants to aid the discovery of causal variants. Figure 2C shows an example of a genetic association analysis performed on variants at the ARID5B locus using the genome browser. It was previously shown that variants within intron 3 of ARID5B are strongly associated with pediatric acute lymphoblastic leukemia (ALL; refs. 14, 15). To determine whether this association also occurs in the SJLIFE and CCSS cohorts, we compared ARID5B variants between two groups: (i) SJLIFE and CCSS ALL survivors of European ancestry and (ii) the gnomAD noncancer population (16). Average admixture coefficients of group 1 shown on the portal allows adjustment for ancestry composition when comparing allele frequencies between groups (Fig. 2C; Supplementary Notes). Fisher’s exact test was used to test whether variants were statistically significantly associated with one of the two groups. This analysis showed that variants within intron 3 of ARID5B were in strong LD and strongly associated with ALL in SJLIFE and CCSS survivors of European ancestry. To determine whether the same association can be observed in survivors of African ancestry, we performed the same analysis using African ancestry ALL survivors (Supplementary Fig. S3A and S3B). As sample sizes were very different between African ancestry (n = 162) and European ancestry (n = 1,474) ALL survivors, the statistical evidence for the association of intron 3 could not be directly compared between the two ethnic groups. Instead, we compared the strength of association of intron 3 to those of its neighboring genomic regions. For European ancestry, ALL association was highly specific to intron 3 of ARID5B relative to its neighboring regions. In contrast, no such specificity was observed in African ancestry (Supplementary Fig. S3A and S3B).
Analyzing the Ototoxicity of Platinum-Based Chemotherapy Using Groups and Summary Plots
Platinum chemotherapy agents, such as cisplatin and carboplatin, are commonly used to treat solid tumors in children, yet these agents can lead to severe ototoxic effects, such as tinnitus, hearing loss, and ear pain (17–20). We used the portal to explore the usage and ototoxic effects of cisplatin and carboplatin in the SJLIFE and CCSS cohorts. As hearing loss was clinically assessed in SJLIFE and self-reported in CCSS, we could not combine the cohorts for this analysis and instead analyzed each cohort separately. First, we divided the SJLIFE cohort into four groups based on platinum agent exposure: exposure to cisplatin only, exposure to carboplatin only, exposure to both agents, and no exposure to both agents (Fig. 3A). These four groups were combined into a custom variable, which could be accessed by any portal feature for comparative analysis of the groups. We overlaid this variable onto a bar chart of cancer diagnosis groups, which revealed a spectrum of platinum agent exposures across diagnosis groups (Fig. 3B). A relatively high percentage of survivors (>35%) in seven diagnosis groups were treated with platinum-based chemotherapy [Fig. 3B (black dots)]. Four of these diagnoses were treated primarily with cisplatin (central nervous system, germ cell tumor, nasopharyngeal carcinoma, and liver malignancies), while only retinoblastoma survivors were treated primarily with carboplatin. Survivors were rarely exposed to both agents, except for neuroblastoma survivors. To determine the ototoxicity associated with these exposures, the SJLIFE cohort was filtered for survivors in these seven diagnosis groups (Fig. 3C), and the custom variable was then overlaid onto a bar chart of hearing loss grades in the subcohort (Fig. 3D). The resulting chart shows that higher percentages of survivors with more severe hearing loss grades were exposed to platinum-based chemotherapy, showing that platinum agents have an ototoxic effect in the SJLIFE cohort (21). This trend was primarily explained by cisplatin exposure rather than carboplatin exposure, suggesting that cisplatin was significantly more ototoxic than carboplatin in the cohort. Similar results were obtained when we performed the analysis in the CCSS cohort (Supplementary Fig. S4A–S4D). This use-case demonstrates how the data visualization and cohort customization functionalities of the portal can be effectively utilized to explore the relationship between diagnosis, treatment, and outcome variables in survivor populations.
Analyzing the ototoxicity of platinum-based chemotherapy. A, “GROUPS” tab containing four custom groups defined by exposure to cisplatin and carboplatin. These groups were used to create a custom variable that can be accessed by other features of the portal to conduct comparative analysis between the groups. B, Bar chart of diagnosis groups overlaid with the custom variable from A. Diagnosis groups with a relatively high percentage (>35%) of survivors exposed to platinum-based chemotherapy are indicated by black dots. C, “FILTER” tab showing that the cohort was filtered for survivors from the seven diagnosis groups indicated in B. D, Bar chart of maximum hearing loss grades in the filtered cohort from C, overlaid with the custom variable from A.
Analyzing the ototoxicity of platinum-based chemotherapy. A, “GROUPS” tab containing four custom groups defined by exposure to cisplatin and carboplatin. These groups were used to create a custom variable that can be accessed by other features of the portal to conduct comparative analysis between the groups. B, Bar chart of diagnosis groups overlaid with the custom variable from A. Diagnosis groups with a relatively high percentage (>35%) of survivors exposed to platinum-based chemotherapy are indicated by black dots. C, “FILTER” tab showing that the cohort was filtered for survivors from the seven diagnosis groups indicated in B. D, Bar chart of maximum hearing loss grades in the filtered cohort from C, overlaid with the custom variable from A.
Regression Analysis Reveals a Novel Interaction Between Mental Health, Age, and Limb Amputation
Survivors of childhood cancer are at greater risk of adverse mental health outcomes (8, 22). Limb amputation as part of cancer therapy can create physical and emotional challenges for children that may lead to adverse psychological and social outcomes (23, 24). We used the summary plots and regression analysis features of the portal to determine whether there was a significant association between limb amputation and the long-term mental health of SJLIFE survivors. First, we compared the SF36 mental health summary scores (25) of survivors with and without amputations using a violin plot (Fig. 4A). Surprisingly, survivors with amputations had slightly higher mental health scores compared to survivors without amputations. We then investigated whether age at cancer diagnosis influenced the mental health outcomes of survivors with amputations. Using the scatter plot feature, we found that SF36 scores and age at cancer diagnosis were positively correlated among survivors with amputations (Supplementary Fig. S5A). Using the violin plot feature, we showed that among survivors with amputations, those diagnosed before the age of 10 had significantly lower mental health scores than those diagnosed at or after the age of 10 (Fig. 4B). This exploratory analysis revealed a potential interaction between limb amputation, age, and mental health in childhood cancer survivors.
Analyzing the association between mental health, amputation, and age. A, Violin plot of SF36 mental health summary scores of SJLIFE survivors stratified by their amputation status. Circles indicate individual data points. Red lines indicate median values. P value was computed on the portal using the Wilcoxon rank sum test. B, Violin plot of SF36 scores of survivors with amputation (red box), stratified by their age at cancer diagnosis. C, Logistic regression analysis of mental health, amputation, and age. Outcome and independent variables are indicated. The age and amputation variables were specified to form an interaction term (dashed line). D, Results of the analysis in C. Odds ratios were used to compute the odds of poor mental health in survivors who received an amputation at age 10 or older (blue arrow) and in survivors who received an amputation under the age of 10 (pink arrow) relative to survivors who did not receive an amputation.
Analyzing the association between mental health, amputation, and age. A, Violin plot of SF36 mental health summary scores of SJLIFE survivors stratified by their amputation status. Circles indicate individual data points. Red lines indicate median values. P value was computed on the portal using the Wilcoxon rank sum test. B, Violin plot of SF36 scores of survivors with amputation (red box), stratified by their age at cancer diagnosis. C, Logistic regression analysis of mental health, amputation, and age. Outcome and independent variables are indicated. The age and amputation variables were specified to form an interaction term (dashed line). D, Results of the analysis in C. Odds ratios were used to compute the odds of poor mental health in survivors who received an amputation at age 10 or older (blue arrow) and in survivors who received an amputation under the age of 10 (pink arrow) relative to survivors who did not receive an amputation.
To evaluate this potential interaction, we carried out a confirmatory analysis by using the regression analysis feature of the portal (Fig. 4C). We set up a logistic regression analysis, in which the outcome variable was the SF36 scores (binned at 40) and the independent variables were sex, age at cancer diagnosis (binned at 10), and amputation status. To test the interaction observed above, we also specified that the age and amputation variables should form an interaction in the analysis. The results of the analysis demonstrated that age at cancer diagnosis and amputation status exhibited a statistically significant interaction (Fig. 4D; Supplementary Fig. S5B). Specifically, survivors who were diagnosed at age 10 or older and who received an amputation had a 56% reduced odds of reporting a low SF36 score compared to survivors with no amputation. Conversely, survivors who were diagnosed under the age of 10 and who received an amputation had a 78% increased odds of reporting a low SF36 score compared to survivors with no amputation. These results confirm the interaction between amputation, age, and mental health that we observed above, such that receiving an amputation at a younger age (<10) is associated with increased risk of poor mental health; whereas, receiving an amputation at an older age (≥10) is associated with reduced risk of poor mental health, possibly by increasing the resilience of survivors.
In summary, the regression analysis feature of the portal provides users with a user-friendly and powerful tool for analyzing associations between variables in pediatric cancer survivor cohorts.
Cumulative Incidence and Genetic Association Analysis of Cardiomyopathy in Different Ancestries
Cardiomyopathy is a common late effect experienced by survivors of childhood cancer. Previous studies reported that survivors of African ancestry have a significantly higher risk of cardiomyopathy compared to survivors of European ancestry (26, 27). We used various tools in the portal to investigate the ancestry-specific incidence of cardiomyopathy in the SJLIFE cohort. First, we used the cumulative incidence analysis feature to compare the cumulative incidence of cardiomyopathy (grades 3–5) between survivors of African and European ancestries (Fig. 5A). Consistent with previous findings, the cumulative incidence of cardiomyopathy was higher in SJLIFE survivors of African ancestry compared to those of European ancestry (P value = 9.7 × 10-6). We then evaluated whether this association between cardiomyopathy and ancestry was influenced by sex. To do so, we divided the cohort into four groups based on ancestry and sex and compared the cumulative incidence of cardiomyopathy between the groups (Fig. 5B; Supplementary Fig. S6). We found that male survivors of African ancestry showed a significantly higher cumulative incidence of cardiomyopathy compared to the other three groups, suggesting that the higher cumulative incidence of cardiomyopathy in survivors of African ancestry is driven primarily by males.
Comparative analysis of cardiomyopathy between African and European ancestries. A, Cumulative incidence analysis of cardiomyopathy (grades 3–5) in SJLIFE survivors stratified by their genetically defined ancestry. Survivors of Asian ancestry and Multi-Ancestry-Admixed were excluded due to absence of cardiomyopathy events. P value was computed using Gray’s test. B, Same analysis as in A, except that survivors were divided into four groups defined by two variables: genetically defined ancestry and sex. C, Genetic association analysis of cardiomyopathy. A logistic regression analysis was set up on the portal with the same outcome variable, independent variables, and inclusion criteria that were used by ref. 28. For the genetic variable (“Variants in a locus”), its genomic region was restricted to that of the rs6689879 variant. D, Genetic association analysis results. Top, results for survivors of African ancestry. Genome browser view of the rs6689879 variant, which was plotted according to its −log10(P value) computed by the regression analysis. Hovering over the variant on the portal displays a panel of regression statistics for the variant. Bottom, results for survivors of European ancestry. E, Zoomed out view of the region from D, showing the entire MAGI3 locus. Top, Zoomed out view for survivors of African ancestry. The same regression analysis was performed separately for each variant within the region. Variants are color-coded according to their LD values relative to the rs6689879 variant (red box). Circle-shaped variants are common variants (effect allele frequency ≥5%) analyzed by standard regression model-fitting. Triangle-shaped variants are rare variants analyzed by Fisher’s exact test. Labeled variants are those that were selected for haplotype analysis. Bottom, Zoomed out view for survivors of European ancestry. Variants are color-coded in the same way as for survivors of African ancestry. AFR, African ancestry; EUR, European ancestry.
Comparative analysis of cardiomyopathy between African and European ancestries. A, Cumulative incidence analysis of cardiomyopathy (grades 3–5) in SJLIFE survivors stratified by their genetically defined ancestry. Survivors of Asian ancestry and Multi-Ancestry-Admixed were excluded due to absence of cardiomyopathy events. P value was computed using Gray’s test. B, Same analysis as in A, except that survivors were divided into four groups defined by two variables: genetically defined ancestry and sex. C, Genetic association analysis of cardiomyopathy. A logistic regression analysis was set up on the portal with the same outcome variable, independent variables, and inclusion criteria that were used by ref. 28. For the genetic variable (“Variants in a locus”), its genomic region was restricted to that of the rs6689879 variant. D, Genetic association analysis results. Top, results for survivors of African ancestry. Genome browser view of the rs6689879 variant, which was plotted according to its −log10(P value) computed by the regression analysis. Hovering over the variant on the portal displays a panel of regression statistics for the variant. Bottom, results for survivors of European ancestry. E, Zoomed out view of the region from D, showing the entire MAGI3 locus. Top, Zoomed out view for survivors of African ancestry. The same regression analysis was performed separately for each variant within the region. Variants are color-coded according to their LD values relative to the rs6689879 variant (red box). Circle-shaped variants are common variants (effect allele frequency ≥5%) analyzed by standard regression model-fitting. Triangle-shaped variants are rare variants analyzed by Fisher’s exact test. Labeled variants are those that were selected for haplotype analysis. Bottom, Zoomed out view for survivors of European ancestry. Variants are color-coded in the same way as for survivors of African ancestry. AFR, African ancestry; EUR, European ancestry.
Next, we explored the genetic basis underlying the increased risk of cardiomyopathy in survivors of African ancestry. A recent genome-wide association study (GWAS) identified variants that were strongly associated with increased risk of cardiomyopathy in survivors of African ancestry (28). Here, we used the regression analysis feature of the portal to reproduce the same genetic association analysis from the study. We set up a logistic regression analysis that used the same cohort (SJLIFE), outcome variable (cardiomyopathy), and independent variables (demographic, treatment, and genetic variables) that were used in the original study (Fig. 5C; Supplementary Fig. S7A; Supplementary Notes). The genetic variable was the GWAS hit on chromosome 1 from the original study (rs6689879). We also used the filter feature to select survivors exposed to either anthracycline chemotherapy or heart radiotherapy (Fig. 5C), consistent with the inclusion criteria used in the original study. In this way, we set up a locus-specific genetic association analysis on the portal that mimics the prior analysis.
The results of the analysis were displayed in a genome browser view centered on rs6689879, which is located in intron 2 of MAGI3 (Fig. 5D). Genetic variants were plotted in the browser according to their −log10(P value) computed by the regression analysis. These results showed that rs6689879 was strongly associated with cardiomyopathy in survivors of African ancestry (odds ratio = 4.772; P value = 2.1 × 10−5), consistent with the original report [Fig. 5D (top); Supplementary Fig. S7B and S7C]. To determine whether this variant was also associated with cardiomyopathy in European ancestry, we re-ran the analysis among survivors of European ancestry. In contrast to survivors of African ancestry, this variant was not associated with cardiomyopathy risk in survivors of European ancestry (odds ratio = 1.09; P value = 0.575), which is also consistent with the original study [Fig. 5D (bottom)]. Thus, the regression analysis feature of the portal enabled us to reproduce the GWAS hit from the original study.
A unique advantage of performing a genetic association analysis on the portal is that the same analysis can be applied dynamically to any genomic region. As shown in Fig. 5E, zooming out the genome browser view to the entire MAGI3 locus triggered the same association analysis on all variants within the 295 kb region. When rs6689879 was selected in the browser, all variants were color-coded according to their LD values with rs6689879. In survivors of African ancestry, rs6689879 was in strong LD (r2 > 0.8) with other variants that were also strongly associated with the occurrence of cardiomyopathy [Fig. 5E (top)]; whereas in survivors of European ancestry, the entire LD block of rs6689879 was not associated with cardiomyopathy [Fig. 5E (bottom)]. Interestingly, five variants (rs7554019, rs10858003, rs7518766, rs1343630, and rs10858006) in moderate LD (0.3 < r2 < 0.5) with rs6689879 had even stronger associations (P value range: 3.75 × 10−6–1.67 × 10−5) than rs6689879 (P value = 2.0 × 10−5; Supplementary Notes). Further analysis showed that these six variants formed seven haplotypes in the study cohort (frequency ≥ 0.1%; Supplementary Notes; Supplementary Fig. S8; Supplementary Tables S3 and S4). Genetic association analysis of these haplotypes revealed that the haplotype containing the alternative alleles of all six variants (h111111) was most strongly associated with cardiomyopathy risk in survivors of African ancestry (OR = 7.449; P = 1.9 × 10−5; Supplementary Table S4). Importantly, the second most strongly associated haplotype was that containing the reference allele of rs6689879 and the alternative alleles of the other five variants (h111011; OR = 4.857; P = 0.006). These results suggest that a novel haplotype (h111*11) may be driving the association between the MAGI3 locus and cardiomyopathy risk in African ancestry survivors.
This use-case demonstrates how the regression analysis feature of the portal can be used to rapidly analyze the associations of genetic variants with an outcome of interest and to also explore the surrounding genetic architecture within the genome browser. Hypotheses generated through these analyses can then be further tested outside the portal environment.
Discussion
Large-scale survivorship cohort studies, such as SJLIFE and CCSS, collect diverse phenotypic and genetic datasets to provide a comprehensive profile of the diagnosis, treatment, health status, and genetic backgrounds of cancer survivors. The breadth and diversity of these datasets make them a critical resource for the survivorship research community. Nevertheless, usage of these datasets will be strongly enhanced by a centralized cancer survivorship data portal that supports on-demand data access and analysis in both open and controlled tiers. The St. Jude Survivorship Portal is the first data portal designed to share and explore pediatric cancer survivorship datasets. The portal hosts extensive clinical and genetic datasets collected from two major survivorship studies, with advanced engineering that enables real-time data exploration and analysis at no cost to the users.
The ability to explore, analyze, and visualize data on-the-fly on the portal provides a significantly improved user experience for on-demand exploration and analysis of survivorship data. Users can easily browse clinical and genetic data using the data dictionary and genome browser. When a variable or genetic variant is selected, its summary statistics are computed on-the-fly and displayed in an interactive chart or browser view that updates dynamically upon customization. Furthermore, users can seamlessly perform cumulative incidence or regression analyses within the portal environment, allowing for rapid data analysis.
These features of the portal allow users to generate new hypotheses, as well as replicate/validate existing ones. For example, the association of ARID5B intron 3 variants with pediatric ALL was initially reported in other cohorts (14, 15) and we used the portal to validate this association in the SJLIFE and CCSS cohorts (Fig. 2). We also used the portal to reproduce a SJLIFE study that reported an association between cardiomyopathy and variants in the MAGI3 locus (Fig. 5). Importantly, reproducing this analysis on the portal revealed other strongly associating variants at the same locus (Fig. 5; Supplementary Notes), leading to the discovery of a novel haplotype that was strongly associated with cardiomyopathy risk specifically in survivors of African ancestry (Supplementary Fig. S8; Supplementary Table S4). We believe that by making data and analysis methods publicly available, the portal will allow the research community to perform reanalysis and cross-validation, ultimately leading to increased accountability and transparency of research findings.
Through active engagement with the SJLIFE and CCSS study committees, we will continuously expand the St. Jude Survivorship Portal in the following areas. First, novel data types will be supported, including longitudinal data collected after baseline assessments (besides chronic health conditions that are already supported), multiomics profiling results such as blood DNA methylome, and imaging data. Next, new functionalities will be supported, such as expanding the genetic association analysis beyond a single locus and integrating functional genomic and fine-mapping tools (29–32) into the analysis to facilitate variant prioritization and causal variant discovery. While new and expanded functionalities will require additional computational power, the downward trend in computational costs will ensure the sustainability of our portal project and allow it to remain as a free resource. Finally, cohorts will be expanded within and beyond the SJLIFE and CCSS studies. Through this cohort expansion, recently enrolled survivors will be included, allowing for incorporation of data elements that better reflect current standards of diagnosis and treatment. These elements include exposures to novel therapeutics such as targeted agents, immunotherapy, and proton radiation, as well as somatic alterations presented at diagnosis. Currently, only germline variants are available on the portal because survivors were diagnosed with cancer before the advent of next-generation sequencing (NGS). Since 2015, NGS assays saw widespread use in the childhood cancer clinical testing (33–35), allowing somatic alteration data to be available for many recently diagnosed survivors. This will enable a future expansion of the “Cancer-related variables” branch of data dictionary to include somatic alterations which can further enhance our understanding of the molecular basis for therapy-related long-term health issues in survivors. These future expansions will establish the St. Jude Survivorship Portal as a central resource for storing, analyzing, and visualizing pediatric cancer survivorship data.
Methods
Data Sources
The portal hosts data generated by the SJLIFE (December 2018 data freeze; refs. 9, 10) and the CCSS (January 2020 data freeze; refs. 11, 12). Both cohorts are retrospective cohorts with prospective follow-up of childhood cancer survivors who have survived at least 5 years following their diagnosis. Survivors in the SJLIFE cohort were diagnosed between 1962 and 2012 and treated at St. Jude Children’s Research Hospital. Survivors in the CCSS cohort were diagnosed between 1970 and 1999 and treated at one of 31 pediatric oncology institutions in the US and Canada (Supplementary Table S1).
Inclusion in the portal was determined as follows. For the SJLIFE cohort, all survivors with WGS data were included in the portal. Additionally, survivors without WGS but who visited the St. Jude campus for clinical assessments were also included in the portal. For the CCSS cohort, survivors with WGS and who were not SJLIFE survivors were included in the portal. WGS at >30× coverage was performed on germline DNA isolated from blood samples for SJLIFE cohort (36) and buccal or saliva samples for CCSS cohort (37).
All data on the portal are baseline data collected from the baseline visits and/or surveys of survivors. The only exception is the data on chronic health conditions (i.e., graded adverse events). Chronic health condition data are longitudinal data collected through clinical assessments (SJLIFE) or surveys (CCSS) of survivors. For subsequent neoplasms, CCSS followed the self-reported occurrences with pathology-report request and confirmation. Chronic health conditions were graded using the Common Terminology Criteria for Adverse Events (CTCAE) scoring system (38) and the modified CTCAE scoring system (39) in the CCSS and SJLIFE cohorts, respectively.
Data Processing
Phenotypic and genetic data of survivors were processed through separate data processing pipelines in preparation for storage on the portal. Phenotypic data (e.g., demographic, clinical, and patient-reported data) were available as tabular text files. A data dictionary defining the hierarchy (i.e., parent–child relationships) of phenotypic variables was prepared using Microsoft Excel (RRID:SCR_016137). The phenotypic text files and data dictionary were then loaded into an SQL database using SQLite (https://sqlite.org/). For the genetic data, variants were called individually from germline WGS results mapped to the hg38/GRCh38 reference genome using the GATK HaplotypeCaller algorithm (RRID:SCR_001876) to generate one GVCF file per survivor. In a second step, variants were called from the GVCF files through a joint genotyping analysis (40). All variants were annotated with features including consequences, classification categories, reference SNP identification numbers, as well as population allele frequencies from curated resources such as gnomAD (16), TOPMed (41), and the SJLIFE community-control cohort (9).
Web Service Implementation
The portal runs on physical servers located within St. Jude Children’s Research Hospital. Server setup has an upfront hardware purchase fee and ongoing maintenance cost, which are both paid for by St. Jude. IT support is managed internally by St. Jude Information Services. User support and access are managed internally by St. Jude Cloud (Supplementary Fig. S1; ref. 13). Summary-level data are open-access on the portal; whereas, individual-level data are protected and under access control. To access individual-level data, users must request access approval from the data access committee through St. Jude Cloud (Supplementary Tutorial; ref. 13).
Data visualization in the St. Jude Survivorship Portal is carried out by an extended ProteinPaint framework, which was written in the JavaScript language and whose server-side component runs in the Node.js environment (https://nodejs.org; ref. 42). All phenotypic data were queried from an SQL database using SQLite (https://sqlite.org). For genetic data, variant data were queried from BCF files using BCFtools (43), polygenic risk score data and genetic ancestry data were queried from the SQL database, LD data was queried from indexed tabular text files using tabix (44), and ancestry PCs were queried from unindexed tabular text files. The retrieved data were visualized on the portal using the D3.js package (http://d3js.org).
Quality Control and Classification of Genetic Variants
- 1.
Genotype call rate (CR): The proportion of samples in which a genotype call could be made. High-quality variants were considered to have a CR ≥ 95%.
- 2.
COV distribution: High-quality variants are assumed not to have unusual read coverage. The distribution of the normalized read coverage at each base (including the variant sites) in the population approximately follows a Gaussian distribution. To score each variant’s coverage distribution, we first found the Gaussian distribution that best fit the normalized coverage of the variant across all the samples. Next, we ran the Kolmogorov–Smirnov (KS) test (pKS) to compare the data with the corresponding distribution. High-quality variants were considered to have pKS > 1 × 10−20.
- 3.
VAF distribution: The genotype of a germline variant could be in one of three states: homozygous for the alternate allele (HO, VAF = 1), heterozygous (HE, VAF = 0.5), or homozygous for the reference allele (HO, VAF = 0). Measured values deviate from the model values above mostly due to sampling variation (in the case of HE) and sequencing errors (in the case of either HO). We modeled the VAF-COV distribution of those variants, f(c), with convolution of Gaussian representing coverage and a sum of three distributions: Gaussian for HE part and exponential distributions for the others.
where mv and sv are the mean and standard deviation, respectively, for the Gaussian distribution of VAF, and mc and sc are the mean and standard deviation, respectively, for the distribution of COV. φerr is the sampling error for HE and φHO is the sequencing error for HO. For each variant, we estimated the model parameters that best fit the distribution. A chi-squared (χ2) based score was used to test the goodness-of-fit. High-quality variants were considered to have a 13 < mc < 59, with χ2 < 22 for SNVs and χ2 < 67 for INDELs.
The above approach applies only to variants with 40 or more HE samples (NHE ≥ 40). Because the model parameters are very consistent across variants, it is possible to use averaged parameters of the model to score the remaining variants. We created a composite score, Zo, to model the distance of the variant from the mean of both COV and VAF distributions. The P value was calculated using a survival function. Fisher’s method was used to generate the combined probability (pz) for HE samples.
where and represent the VAF value and the COV of HE samples, respectively. High-quality variants were considered to have a 13 < mc < 59, with pz > 0.15 for SNVs and pz > 0.1 for INDELs.
- 4.
HWE: Deviation from the HWE may be a sign of genotyping error. We used the HWE test to assess the deviation of the genotype frequencies of a population from the expected frequencies under the assumption of random mating and no selection. High quality variants were considered to have an HWE P value (pHWE) > 1 × 10−6.
The pKS, χ2, and pz cutoffs used above were determined by selecting high-confidence SNVs and INDELS from our call set. SNVs were designated high-confidence if they overlapped with both high-confidence SNVs from the Genome in a Bottle (GIAB) consortium (45) and highly validated SNVs from the HapMap database (RRID:SCR_002846; ref. 46). INDELS were considered high-confidence if they overlapped with both GIAB high-confidence INDELs and best known indels from 1000G (47). We restricted GIAB high-confidence variants to those from high-confidence regions of a pilot genome (NA12787/HG001) that were also called from long read sequencing. GIAB variants that failed the filtering process and were not detected by long read sequencing were used as true negatives. The optimal cutoffs were then selected based on ROC analysis.
LD Analysis
Variants were selected for LD analysis if they were determined to be of high quality (according to classification scheme above). Any variants with a minor allele frequency below 5% were excluded from the analysis. The LD coefficient r2 was calculated for all variant pairs within 200 kb of each other using PLINK (RRID:SCR_001757; ref. 48). LD computations were performed separately for survivors of European ancestry and for survivors of African ancestry.
Genetic Ancestry
Genotypes for 12,000 loci selected with low LD across three continental reference populations (49) were extracted from the whole-genome sequencing data. Similarly, the genotype data were extracted from the 1000 Genomes Project phase 3 version 5 data for the CEU (Utah residence with Northern and Western European ancestry, n = 99), JPT+CHB (Japanese and Chinese, n = 207) and YRI (Yoruba Nigeria, n = 108) populations (RRID:SCR_008801). These were subsequently used as reference populations for the STRUCTURE analysis to infer the admixture coefficients for each survivor (RRID:SCR_017637, https://code.google.com/archive/p/glu-genetics/). Survivors with ≥80% CEU ancestry were genetically imputed as individuals of European. Survivors with <80% CEU and <10% JPT+CHB ancestry were imputed as individuals of African. Survivors with <80% CEU and <10% YRI ancestry were imputed as individuals of Asian. All other survivors were imputed as individuals of Multi-Ancestry-Admixed.
PC Analysis
We performed PC analysis (PCA) on four sample groups: all survivors of European ancestry (SJLIFE + CCSS), all survivors of African ancestry (SJLIFE + CCSS), SJLIFE survivors of European ancestry, and CCSS survivors of European ancestry. SJLIFE-specific and CCSS-specific PCA analysis of survivors of African ancestry was not performed due to limited sample size. PCA was performed with 12,000 informative SNPs across the genome using smartpca from the software package EIGENSTRAT (50). Samples with relatedness were removed before the analysis. The top 10 PCs were used to adjust for population stratification in analyses on the portal.
Polygenic Risk Scores
The PRSs on our portal were computed using published datasets from the PGS catalog (https://www.pgscatalog.org; ref. 51). We downloaded all 3,271 PGS datasets available in the catalog (accessed April 2023). We used the variants and effect weights from each dataset to compute novel PRSs for survivors on the portal. Duplicated, ambiguous, and non-autosomal variants were discarded from the dataset prior to PRS computation. The population of survivors used for PRS computation was determined by the distribution of ancestries within a given PGS dataset. If the PGS dataset was developed from a specific ancestry (e.g., European ancestry), two populations were constructed: one for survivors of the specific ancestry and a second for all survivors. If the PGS dataset was developed from multiple ancestries (e.g., European and African ancestries) then a single population encompassing all survivors was constructed. The PRS was then computed twice for each of the constructed populations: once excluding variants whose minor allele frequency (MAF) in the given population was less than 1%, and once using all variants regardless of their MAF. Accordingly, for a PGS dataset developed from a specific ancestry, four PRSs were computed: a PRS for survivors of the specific ancestry (MAF > 1%), a PRS for survivors of the specific ancestry (no MAF cutoff), a PRS for all survivors (MAF > 1%), and a PRS for all survivors (no MAF cutoff). For a PGS dataset developed from multiple ancestries, two PRSs were computed: a PRS for all survivors (MAF > 1%) and a PRS for all survivors (no MAF cutoff).
Cumulative Incidence Analysis
Cumulative incidence analyses on the portal were performed on longitudinal chronic health condition data (see “Data sources”). For each analysis, a specific chronic health condition (e.g., cardiomyopathy) and a specific range of grades (e.g., grades 3–5) is defined by the user. We developed a workflow to assign each survivor an event status and a time-to-event value for a given condition at a given range of grades. First, we discarded any occurrences of the condition that occurred prior to cancer diagnosis, as these occurrences are not late effects of cancer or its therapy. Second, we computed the proportion of survivors who had the condition prior to the start of cohort follow-up (i.e., 5 years post cancer diagnosis). Note that the cumulative incidence curve starts at this value to indicate the proportion of survivors who have had the condition by the start of follow-up. Finally, event status codes were then assigned to each survivor as follows: censored individuals were assigned a code of 0, individuals with the condition were assigned a code of 1, and individuals with a competing risk event (death that is not a grade-5 condition) were assigned a code of 2. Time-to-event values were assigned as follows: for censored individuals, years from the start of follow-up to the last follow-up; for individuals with the condition, years from the start of follow-up to the first occurrence of the condition; for individuals with a competing risk event, years from the start of follow-up to death.
Survivors’ event status codes and time-to-event values were then used to compute the cumulative incidence of the condition using the R package cmprsk. When survivors were stratified into multiple series, multiple cumulative incidence curves were estimated, and these curves were compared in pairwise fashion using Gray’s test (52). Permutation of Gray’s test statistic was performed if the sample size of the series comparison was below a certain cutoff. This cutoff was determined by setting up a 2 × 2 contingency table between event status code (0 or 1) and series (series 1 or series 2). If the expected count of any cell in the contingency table was less than 5, then permutation of Gray’s test statistic was performed. For each permutation, the assignments of survivors to series were shuffled, the cumulative incidence of each series was recomputed, and the resulting cumulative incidence curves were compared using Gray’s test. Initially, 100 permutations were conducted to generate 100 permuted test statistics that were used to compute a two-sided P value. We then used a series of checks to determine if additional permutations were necessary. If the computed P value was greater than 0.2, then no additional permutations were deemed necessary since it was unlikely that additional permutations would yield a statistically significant P value. If the P value was less than or equal to 0.2, then an additional 100 permutations would be performed. If the new P value was less than or equal to 0.1, then an additional 300 permutations would be performed. If the new P value was less than or equal to 0.05, then an additional 500 permutations would be performed, making a total of 1,000 permutations. The P value obtained from these checks was used as the reported P value for the permutated Gray’s test.
Regression Analysis
Three types of regression analysis are supported on the portal: linear, logistic, and Cox. These analysis types are distinguished by the type of outcome variable used in the analysis. In linear regression, the outcome variable is a continuous variable. In logistic regression, the outcome variable is a binary variable, which can be either a categorical variable with two categories or a binned continuous variable with two bins. In Cox regression, the outcome variable is a time-to-event variable that has both a time component and an event component. The time-to-event variables that are available on the portal are the CTCAE-graded chronic health conditions. The time and event components for these conditions were computed using a workflow similar to that described above for cumulative incidence analysis. The only differences are that for the Cox regression workflow, (i) survivors with an event before the beginning of follow-up (i.e., 5 years post cancer diagnosis) were discarded, and (ii) survivors with competing risk events (event status code = 2) were re-coded as censored (event status code = 0). Note that this does not mean that competing risk events are ignored. Cox regression estimates the (hazard) rate of an event of interest, which quantifies the “force” of the event occurrence while alive and it is appropriate to terminate the at-risk time at occurrences of either censoring or competing-risk event (death).
All regression analyses were implemented in the R statistical environment (https://www.r-project.org, RRID:SCR_001905). In R, the lm() function was used for linear regression, the glm() function was used for logistic regression, and the coxph() function from the survival package was used for Cox regression. When the analysis contained a genetic locus variable as an independent variable, the regression analysis was performed as follows: First, to control for ancestry-specific genetic architecture and population stratification, the analysis was restricted to survivors of a user-defined ancestry (i.e., European ancestry or African ancestry). To further control for stratification, the top 10 precomputed ancestry PCs were automatically added to the regression analysis as independent variables. Second, a separate regression model was fit for each genetic variant within the locus represented by the genetic variable (while keeping all other independent variables constant). Doing so allowed each variant to be analyzed separately; otherwise, the analysis would have considered all of the variants as independent variables within the same model. Third, rare variants (i.e., variants below a user-specified allele frequency cutoff) were handled differently in the analysis depending on the type of regression analysis. For linear regression, rare variants were analyzed by using the Wilcoxon rank sum test to compare the distribution of the outcome variable between survivors who either carried or did not carry the effect allele. For logistic regression, rare variants were analyzed by using the Fisher’s Exact Test to determine the association between the outcome variable categories and the presence/absence of the effect allele. For Cox regression, rare variants were analyzed by using the cumulative incidence analysis to estimate the cumulative incidence of the outcome variable for survivors who either carried or did not carry the effect allele. These curves were then compared to one another using Gray’s test. The Wilcoxon rank sum test was implemented in R, the Fisher’s Exact Test was implemented using the “fishers_exact” crate in Rust (https://docs.rs/fishers_exact/latest/fishers_exact/), and the cumulative incidence analysis was implemented as described above.
Haplotype Analysis
SJLIFE and CCSS variants within ±1 Mb of the rs6686879 SNP as the target region were phased using SHAPEIT5 (53). The phasing process involved two steps. First, common variants (MAF ≥ 0.01) were phased in the extended region (i.e., ±4Mb of rs6686879) using the module “SHAPEIT5_phase_common” to generate the haplotype scaffold. Second, all variants (including those with MAF < 0.01) were phased in the target region using the module “SHAPEIT5_phase_rare” and the haplotype scaffold from the first step. Haplotypes were constructed using the six selected risk variants (Fig. 5; Supplementary Fig. S8). Common haplotypes were then selected for genetic association analysis by determining their frequency within the cohort of 2,589 survivors (2,189 of European ancestry and 400 of African ancestry) used in the genetic association analysis in Fig. 5. Those haplotypes that existed in at least 5 out of the 5,178 chromosomes in the cohort were selected for genetic association analysis. Genetic association analysis was performed in the same way as described in Fig. 5, except that the constructed haplotypes were added as covariates to the regression model. To test whether the effect of each haplotype was independent of rs6686879, we also conducted a conditional analysis by adding rs6686879 as a covariate to the regression model.
Data Availability
All data are available on the St. Jude Survivorship Portal (https://survivorship.stjude.cloud). The code used to develop the portal is available on Github (https://github.com/stjude/proteinpaint). Hyperlinks to portal sessions and video tutorials for every use-case can be found in Supplementary Table S5 and at https://proteinpaint.stjude.org/survivorship/paper2024/.
Authors’ Disclosures
K. Shelton reports grants from NIH and other support from American Lebanese Syrian Associated Charities during the conduct of the study. C. Im reports grants from US National Cancer Institute outside the submitted work. K.K. Ness reports grants from National Institutes of Health during the conduct of the study. G.T. Armstrong reports grants from NIH during the conduct of the study. M.M. Hudson reports grants from National Cancer Institute during the conduct of the study. J. Zhang reports grants from National Cancer Institute during the conduct of the study. No disclosures were reported by the other authors.
Authors’ Contributions
G.Y. Matt: Conceptualization, resources, data curation, software, formal analysis, investigation, visualization, methodology, writing–original draft, writing–review and editing. E. Sioson: Conceptualization, resources, data curation, software, visualization, methodology. K. Shelton: Conceptualization, resources, data curation, software, visualization, methodology, writing–review and editing. J. Wang: Conceptualization, resources, data curation, software, visualization, methodology, writing–review and editing. C. Lu: Resources, software, visualization, methodology, writing–review and editing. A. Zaldivar Peraza: Resources, software, visualization, methodology, writing–review and editing. K. Gangwani: Resources, software, visualization, methodology, writing–review and editing. R. Paul: Resources, software, visualization, methodology. C. Reilly: Resources, software, visualization, methodology. A. Acić: Conceptualization, resources, data curation, software, visualization, methodology, writing–review and editing. Q. Liu: Resources, software, methodology. S.R. Sandor: Resources, software, methodology. C. McLeod: Resources, software, methodology. J. Patel: Resources, software, visualization, methodology. F. Wang: Resources, formal analysis, investigation, methodology, writing–review and editing. C. Im: Conceptualization, resources, methodology, writing–review and editing. Z. Wang: Conceptualization, resources, methodology, writing–review and editing. Y. Sapkota: Resources, formal analysis, validation, methodology, writing–review and editing. C.L. Wilson: Conceptualization, resources, methodology, writing–review and editing. N. Bhakta: Conceptualization, resources, methodology. K.K. Ness: Conceptualization, resources, supervision, funding acquisition, methodology, project administration, writing–review and editing. G.T. Armstrong: Conceptualization, resources, supervision, funding acquisition, methodology, project administration, writing–review and editing. M.M. Hudson: Conceptualization, resources, supervision, funding acquisition, methodology, project administration, writing–review and editing. L.L. Robison: Conceptualization, resources, supervision, funding acquisition, methodology, project administration, writing–review and editing. J. Zhang: Conceptualization, resources, formal analysis, supervision, funding acquisition, validation, investigation, methodology, writing–original draft, project administration, writing–review and editing. Y. Yasui: Conceptualization, resources, formal analysis, supervision, funding acquisition, validation, investigation, methodology, writing–original draft, project administration, writing–review and editing. X. Zhou: Conceptualization, resources, data curation, software, formal analysis, supervision, funding acquisition, validation, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing.
Acknowledgments
We thank Sarah August for helpful comments on the manuscript. This work was supported by: NIH U01 CA195547, M.M. Hudson/K.K. Ness, principal investigators; NIH R01 CA216354, J. Zhang/Y. Yasui, principal investigators; NIH R01 CA270157, N. Bhakta/Y. Yasui, principal investigators; NIH U24 CA55727, G.T. Armstrong, principal investigator; NIH/NCI R01 CA261898, Y. Sapkota, principal investigator; Cancer Center Support [CORE] Grant CA21765, C Roberts, principal investigator; and the American Lebanese Syrian Associated Charities. We would like to thank the St. Jude Information Services for providing the hardware support including systems administration and maintenance of the servers used by the portal.
Note Supplementary data for this article are available at Cancer Discovery Online (http://cancerdiscovery.aacrjournals.org/).