Abstract
Background: A key strength of population sciences and epidemiology is the breadth and depth of the data these studies, especially prospective cohorts, collect over time. However, publications from and websites of these studies tend to present collapsed or aggregate data in traditional tables, often underestimating the true complexity of information in those data. One goal of the NIH Strategic Plan for Data Science is to “support the development and dissemination of advanced data management, analytics, and visualization tools.” We recently developed a robust, innovative, and scalable data visualization environment for the California Teachers Study (CTS), a prospective cancer epidemiology cohort.
Methods: We identified the need to eliminate the practice of storing CTS data in a collection of disparate data silos and constructed a data warehouse to compile two decades of study data, eliminating additional harmonization steps. We then developed a data commons, a remotely accessible environment where cohort data could be utilized by our researchers and study collaborators. This allowed questionnaire-based data, cancer registry information, and hospitalization and mortality data from our study participants to be easily accessed in a secure environment. Serving as a single unified source of information, the data warehouse, coupled with the industry-leading data visualization tool Tableau, was configured using the following key epidemiologic cohort activities: reports, cohort selection, calculated values, and exploring and transforming data variables for analysis.
Results: With the ability to sort and filter on specific covariates we can define subsets of populations for cohort selection and endpoint-specific analysis. Using interactive reports with drags and drops, and creating calculated variables within Tableau, gives us the ability to visualize time between event data, such as time from study entry to a cancer diagnosis, multiple cancer diagnoses, hospitalizations, or mortality without requiring programmatic back-end coding.
Conclusion: The creation of a data warehouse and data commons has given the CTS the ability to visualize information about the demographics and health-related patterns of our study participants, leading to faster answers to complex questions about our data. The interactive data visualizations that we have created allow users the ability to identify new patterns easily, and drill down into the data for a more detailed look. Defining analytic populations through cohort selection and adding geospatial data to our data warehouse has already resulted in the discovery of new patterns that had been unidentified before—simply by providing the same information but utilizing a more accessible and innovative approach to presenting it. As epidemiology uses and shares more of its data, it’s essential to help the widest possible community understand those data in detail. Data visualization can provide specific benefits to key, essential population sciences research activities.
Citation Format: Nadia T. Chung, James V. Lacey Jr., Emma Spielfogel, Paul Hughes, Sandeep Chandra, Elena Martinez, Jennifer Benbow. Opportunities for and benefits of incorporating data visualization in population sciences [abstract]. In: Proceedings of the AACR Special Conference on Modernizing Population Sciences in the Digital Age; 2019 Feb 19-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2020;29(9 Suppl):Abstract nr A01.