Ongoing epidemiologic studies represent national (and international) treasures of data—with information spanning behavior, environment, genetics, genomics, and heath—that are nearly impossible to replicate. Many existing cancer cohorts uniquely enable full understanding of cancer etiology and prognosis by integrating multiple data types, and providing data across years of follow-up, often both before and after cancer diagnosis. These combinations of many data types and many years of follow-up are critical to effective research and discovery, since cancers develop and progress over years, if not decades, and result from the combined influence of behavior, environment, and genomics. However, facilitating and simplifying the full utilization of complex epidemiologic data by an array of interdisciplinary scientists is a large endeavor. In particular, many cohorts still operate with data management, access, and analytic systems that date to their initiationmany years before modern data science and computing approaches existed. Updating of systems for leveraging cohorts is necessary across many fronts; as a case study, the Nurses’ Health Study has been working on the following span of projects to create new, cloud-based platforms for research: 1. Data Management: (i) organize all data into a small number of files (e.g., one for questionnaire data, one for disease data, one for biomarker data, etc.) that are easy to find and use; (ii) harmonize data across all time periods of follow-up; and (iii) clean all data to remove stray codes and notations. 2. Data Visualization: develop software that generates visualizations of participant profiles, spanning behavior, environment, genomics, and diagnoses/health outcomes. 3. Data Analysis: create a suite of interfaces to open-source tools enabling researchers to: (i) create phenotypes and behavioral variables; (ii) access, filter, and merge biologic and genomic data; (iii) integrate cancer data; and (iv) conduct statistical analyses. 4. Training/Education: develop documentation and educational materials for all data, tools, and software, designed to empower investigators and reduce extraneous efforts in scientific exploration of epidemiologic data. There have been many initial successes (examples will be presented), heavily supported by teaming with data science experts from prominent institutions and industry, who bring modern tools, software, and knowledge to epidemiology and computing. Ongoing challenges include identifying funding sources for such large infrastructure projects, especially data management tasks, which are not highly marketable. Psychological barriers exist as well in convincing multiple generations of investigators to learn new systems and skills. Overall, modernizing large cancer cohorts is an intimidating although crucial venture, which will permit fully leveraging existing cancer epidemiology cohorts to accelerate novel research into cancer prevention and treatment.

Citation Format: Francine Grodstein. Leveraging modern data science to optimize discovery within epidemiologic treasures: Case study—the Nurses’ Health Study [abstract]. In: Proceedings of the AACR Special Conference on Modernizing Population Sciences in the Digital Age; 2019 Feb 19-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2020;29(9 Suppl):Abstract nr IA21.