The vision of precision medicine relies on the linkage of large-scale clinical, molecular, environmental, and patient-generated datasets. Traditionally, diverse data streams have been analyzed independently, including the wealth of information captured in electronic health records (EHRs). However, to successfully leverage the volumes of data that can be used in health care, cross-modality integration is necessary. We have developed a clinical data warehouse for prostate cancer that integrates multiple data streams, from structured EHR data to imaging, state registries to patient-generated data, as well as the rich granular information contained in unstructured clinical narrative text. This rich, longitudinal dataset facilitates secondary data use and enhances observational research in oncology. We have developed advanced machine learning approaches to analyze these data. Our methods can accurately classify patients into clinical and pathologic stage groups and prognostic risk groups. These classifications can be used at point of care to guide optimal treatment pathways based on evidence-based guidelines (e.g., identify high-risk patients for whom a radionuclide bone scan is recommended). Furthermore, linking routinely collected patient surveys to EHRs reveals important differences in global physical and mental health between demographic and clinical subgroups. Giving clinicians visibility into these patient-reported outcomes can help personalize treatment pathways and may inform population health initiatives to support vulnerable groups. The granular health data we collect and link also provide population-based views into changes in treatment patterns and effects from policy changes, e.g., changes to PSA screening guidelines. The integration of diverse data streams presents unique technical, semantic, and ethical challenges; however, our work suggests that multimodal clinical data can significantly improve the performance of prediction algorithms and guide treatment decisions, powering knowledge discovery at the patient and population level.

Citation Format: Tina Hernandez-Boussard. Linking heterogeneous data to enable knowledge discovery in health care [abstract]. In: Proceedings of the AACR Special Conference on Modernizing Population Sciences in the Digital Age; 2019 Feb 19-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2020;29(9 Suppl):Abstract nr IA14.