Myelofibrosis (MF) is a devastating myeloproliferative neoplasm (MPN) hallmarked by marrow fibrosis, extramedullary hematopoiesis, vascular thromboembolism, and ~50% incidence of JAK2V617F. MF is difficult to study in large EHR datasets due to clinical heterogeneity and unreliable ICD coding. The Synthetic Derivative is a cloned and de-identified research EHR with 2.9 million unique patients linked to BioVU, a DNA biorepository. To develop phenotype-genotype associations, we created an algorithm to classify MF, using NLP with negation detection of MF keywords, medications, and ICD coding. To enrich our cohort, we developed JAKextractor, an algorithm to identify patients tested clinically for JAK2V617F across all 248,000 BioVU patients.

For MF identification, we trained a supervised learning algorithm to learn decision rules that encode counts of MF-specific ICD codes, medications, text mentions, as well as the assertion status of MF and JAK2 mentions in patient notes. Experiments were evaluated using a 10-fold cross validation scheme. JAKextractor used pattern matching to extract the status (WT vs MUT) of each JAK2 text mention. Machine learning predicted a JAK2V617F patient based on the information extracted in the previous step from patient notes. We subsequently genotyped banked DNA on an enriched subset of MF cases via a Illumina® TruSight myeloid NGS panel to validate JAKextractor.

The top performing MF algorithm combined all sources of clinical information and achieved an F1-measure (F1) of 96% and identified 309 MF patients in BioVU. The extracted decision rule for predicting an MF patient was [JAK2V617F ^ ICD>1] v [JAK2WT ^ ICD>1 ^ TXT>3]. ICD is necessary but not sufficient to predict MF identification. Utilizing only ICD counts created a detrimentally lower F1 of 88% (P<0.001). Our MF cohort had a mean age at onset (60.3±12.6), last visit age (63.1±12.1), and JAK2V617F (46.1%). The mean age of MF onset was higher with JAK2V617F (64) compared to JAK2WT (57) (P<0.001). Survival was no different between JAK2V617F and JAK2WT MF cases via log-rank test (P=0.11) with median survival 108 months. 131 MF cases were genotyped with JAK2V617F in 71/131 (54.2%) compared to 66/131 (50.4%) via JAKextractor. Mean JAK2V617F allelic frequency was 0.569 with detection ranging 0.069-0.976. Ten cases displayed disagreement between JAKextractor and NGS. There were 2 FP and 4 FN JAKextractor predictions 6/131(4.6%); 2 true NGS failures, 1 incomplete chart and 1 loss of JAK2V617F over time. NGS detected JAKV617F on MF patients who had not been previously tested 7/131 (5.3%).

Our results demonstrated successful identification of MF and JAK2V617F within an EHR. We established the feasibility of creating a MPN database with retrospective genotyping of biobanked DNA. We plan for scaled implementation of similar algorithms across all myeloid disease within BioVU with the ability to retrospectively genotype each case.

Citation Format: Cosmin A. Bejan, Andrew Sochacki, Shilin Zhao, Yaomin Xu, Michael Savona. Identification of myelofibrosis from electronic health records with novel algorithms and JAKextractor [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 5303.