Abstract
Introduction: The need to rapidly collect, integrate, and share data on COVID-19 patients with cancer at scale has given rise to multiple internal and cross-institutional research registries. These registries support use cases that require data at different levels of granularity and are built using mixed standards. Ensuring semantic interoperability and quality of this data is critical for generating reliable and reproducible evidence. At MSK, we created a framework that enabled the rapid development of semantically compatible COVID and cancer registries and data exchange.
Background: Handling and harmonizing real-world data for COVID and cancer research presented with typical challenges: maintenance of complex patient cohorts; reconciling different levels of temporal and semantic granularity; supporting crosswalks between different representations without information loss; and sharing it internally and with research consortia. Solving these challenges for COVID and cancer studies necessitated advanced infrastructure and harmonization solutions.
Methods: We used MSK Extract, our research platform, to create an integrated COVID and cancer data research framework. It included a library of reusable standardized REDCap used in multiple RedCap instances supporting individual research studies; PostgreSQL database containing patient cohorts and data from Electronic Health Records (EHR) standardized to OMOP; and ETL pipelines. Our approach to the REDCap design and data management allowed for combined sets of detailed, atomic, and aggregate-level data through a combination of abstraction, curation, and extraction of data from different sources. We developed reconciliation methodology between initial curation, available raw data, and the subsequent abstraction. We enforced consistent temporal constraints on data extraction and curation. We used the OMOP vocabulary for semantic harmonization, mapping metadata from internal and external registries to OMOP concepts. We linked procedure and medication codes to high-level treatment groups leveraging classifications available in the OMOP vocabulary.
Results: Our approach to the REDCap design supported various analytical use cases and enabled data sharing between different investigators and registries. Reuse of the data that was previously abstracted complemented with the data extracted from EHR allowed investigators and their teams to quickly review, validate, and update the prior curation. Explicit temporal constraints supported alignment between different registries. Using the OMOP standards and high-level treatment classifications supported data conversion between various registries and integration of the data collected via REDCap and sourced from EHR.
Conclusion: Using real-world data for observational COVID and cancer research presented us with opportunities to improve and mature our evolving research infrastructure and better support internal and distributed research, and highlighted the need for uniform data standards in the cancer domain.
Citation Format: Rimma Belenkaya, Adam Watson, Shantha Bethusamy, Meera Patel, Tatyana Sandler, Julian Schwartz, James Park, Maggie Dobbins, Molly Maloy, Michael Lam, Nadia Bahadur, John Philip. Data harmonization for COVID-19 and cancer research registries [abstract]. In: Proceedings of the AACR Virtual Meeting: COVID-19 and Cancer; 2020 Jul 20-22. Philadelphia (PA): AACR; Clin Cancer Res 2020;26(18_Suppl):Abstract nr PO-061.