Background: There is tremendous potential to leverage the value from electronic health records (EHRs) and population-based cancer registry data by integrating them to study the cancer continuum from screening to diagnosis and treatment. Registries provide detailed diagnosis and tumor characteristics and initial treatment summaries (even across health care systems), while EHRs contain rich clinical detail such as medical histories and preventive screenings. Moreover, to the extent that EHR systems may be comparable across health care systems, pooling EHRs from multiple health care systems can enhance the generalizability of findings. Furthermore, there is the potential to integrate neighborhood or environmental data based on geocoding patient addresses for research focusing on multilevel exposures. With patient populations frequently in the millions, the opportunities for highly powered scientific discovery are nearly unparalleled. However, the logistical complexity, cost, and possibility for systematic error to hamper findings may be considerable. Recently, we linked the EHRs of a large, multispecialty health care delivery system with the California cancer registry (CCR) and established two retrospective cohort studies. We will describe the process undertaken, some challenges and lessons learned, and the resulting cohort populations.

Methods: Using a probabilistic linkage process, we combined Sutter Health EHR data with the CCR to identify incident cancer cases occurring among the EHR population. We then established two retrospective cohorts each with 13 years of follow-up time and distinct scientific objectives. The mammogram cohort is designed to study patterns of screening utilization and breast cancer-related outcomes. The lung cancer cohort will elucidate risk factors for lung cancer in never-smokers, an uncommon diagnosis that has not been successfully studied in other data sources due to lack of smoking status and/or small sample sizes.

Results: From an adult EHR population of 4.5 million, we identified 306,554 unique cancers diagnosed between 2000 and 2013. Linkage efforts, including multiple institutional approvals, took 18 months to complete. We compared the identified cases with the rest of the CCR patients in the Sutter Health catchment region to evaluate external validity. Inclusion/exclusion criteria resulted in a mammogram cohort of 516,419 screened women, 11,222 of whom were diagnosed with breast cancer, and a lung cancer cohort with 1.3 million male and female never-smokers, 1,143 of whom were diagnosed with lung cancer. For the latter, we also undertook patient address geocoding and linkage with neighborhood and environmental data. Multiple imputation and sensitivity analysis will be used to explore the impact of missing EHR data, a common issue in such studies.

Conclusions: Our experiences set the foundation to encourage and inform researchers interested in working with EHRs for cancer research as well as provide context for improvement and scalability.

This abstract is also being presented as Poster A27.

Citation Format: Caroline A. Thompson, Mindy DeRouen, Su-Ying Liang, Anqi Jin, Harold S. Luft, Daphne Lichtensztajn, Salma Shariff-Marco, Iona Cheng, Scarlett L. Gomez. Linking electronic health records with cancer registry data for epidemiology and health services research: Challenges and opportunities [abstract]. In: Proceedings of the AACR Special Conference on Modernizing Population Sciences in the Digital Age; 2019 Feb 19-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2020;29(9 Suppl):Abstract nr PR08.