Background: HER2 status is important in breast cancer prognosis but has not been well collected in NCI-SEER affiliated cancer registries until recent years. For example, in a cohort of 2,846 women who were diagnosed with breast cancer from 1997 to 2011 in a large health plan, only 49% of these patients had known HER2 status. However, many of these with missing HER2 status were documented in clinical notes. This study used Natural Language Processing (NLP) technology to identify and extract HER2 status. NLP results were validated by comparison with cancer registry data and any discrepancies were manually reviewed.

Method: We assembled a cohort of 2,846 breast cancer patients from the membership of a health plan based in southern California, Kaiser Permanente. All free text notes including pathology reports, progress notes, and discharge summaries were extracted from our electronic medical record (EMR) system. A window of 9 months before and after the diagnosis date was applied to restrict the number of notes needed to be processed. Overall, 513,903 clinical notes were processed and indexed, which averages to 180 notes per patient. Separate ontologies were created for the HER2 terminology and HER2 status. HER2 status values included positive, negative, borderline, test performed and test not performed. The NLP system employed additional components such as spelling correction, acronym recognition and negation identification. The output from the NLP system was further processed in three steps: Identification of the HER2 concept, followed by extraction of the most likely HER2 value, and lastly, a decision module to select the most likely HER2 value if there were conflicting values.

Results: Use the cancer registry data as the gold standard, for positive and negative HER2 values, the sensitivity and specificity of the NLP algorithm were 94.7% and 93.3%, respectively, including the cases where NLP did not select either positive or negative values. A manual chart review was performed on the discrepant cases. We found that the NLP were correct for many of these cases. For example, out of the 39 NLP positive but registry negative cases, 4 were false positives and 35 were true positives. Compared to the cancer registry data, NLP increased capture of positive and negative HER2 cases from 49% to 73% of the cohort population.

Discussion: NLP provides the opportunity to process clinical notes which are added to the EMR after the cancer registry has completed documenting the patients' initial course of cancer treatment. On the other hand, NLP was not able to access some of the HER2 related clinical notes. For example, according to our chart review, some notes were stored as scanned images and not retrievable. In addition, clinical notes with incorrect filing dates so patients without any diagnosis date were not processed. We also identified key challenges in determining HER2 results using NLP in this study. First, non-standard terminology for the HER2 receptor in the clinical notes hampered the NLP effectiveness. Second, clinicians described HER2 results in many different ways.

Citation Information: Cancer Res 2012;72(24 Suppl):Abstract nr P1-07-12.