Abstract
Current models for correlating electronic medical records with -omics data largely ignore clinical text, which is an important source of phenotype information for patients with cancer. This data convergence has the potential to reveal new insights about cancer initiation, progression, metastasis, and response to treatment. Insights from this real-world data will catalyze clinical care, research, and regulatory activities. Natural language processing (NLP) methods are needed to extract these rich cancer phenotypes from clinical text. Here, we review the advances of NLP and information extraction methods relevant to oncology based on publications from PubMed as well as NLP and machine learning conference proceedings in the last 3 years. Given the interdisciplinary nature of the fields of oncology and information extraction, this analysis serves as a critical trail marker on the path to higher fidelity oncology phenotypes from real-world data.
Introduction
Data produced during the processes of clinical care and research in oncology are proliferating at an exponential rate. In the past decade, use of electronic medical records (EMR) has increased significantly in the United States (1), driven at least in part by incentivization from the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 (2). In parallel, large databases such as the NCI's Surveillance, Epidemiology, and End Results program (SEER; ref. 3), the National Cancer Database (NCDB; ref. 4), The Cancer Genome Atlas (TCGA; ref. 5), and the Human Tumor Atlas Network (HTAN; ref. 6) are increasingly important avenues for clinical and translational oncology research. However, significant nuanced phenotype data are locked in clinical free-text, which remains the primary form of documenting and communicating clinical presentations, provider impressions, procedural details, and management decision-making (7). Despite the proliferation of EMR and -omics data, critical and precise phenotype information is often detailed only in these clinical texts. Natural language processing (NLP), broadly defined as the transformation of language into computable representations, is key to large-scale extraction of nuanced data within clinical texts. As a subfield of artificial intelligence, clinical NLP (cNLP), which refers to the analysis of clinical or health care texts (as opposed to clinical application, per se) has been around for decades. However, only in recent years have compute power and algorithms advanced sufficiently to demonstrate its power toward broadening oncologic investigation.
There are excellent prior review articles of cNLP. Spyns (8) covers the period before 1995. Meystre and colleagues (9) survey the 1998 to 2008 developments. Yim and colleagues (10) provide an overview with a special emphasis on oncology for the period of 2008 to 2016. Neveol and colleagues (11) offer the first broad overview of cNLP for languages other than English. These surveys capture three distinct methodology phases in NLP, from exclusively rule-based systems through the shift toward probabilistic methods to the dominance of machine learning. Kreimeyer and colleagues (12) review existing cNLP systems. Some popular cNLP systems are MetaMap (concept mapping; refs. 13,,14), Apache cTAKES (classic NLP components, concept mapping, entities and attributes, relations, temporality; refs. 15, 16), YTex (entity and attributes; ref. 17), OBO annotator (concept mapping; ref. 18), TIES (linking of pathology reports to tissue bank data; ref. 19), MedLEE (entities and attributes, relations; ref. 20), CLAMP (entities and attributes; ref. 21), and NOBLE (entities and attributes; ref. 22).
The mid-2010s mark a transformational milestone for the field where plentiful digitized textual data and hardware advances met powerful mathematical abstractions in a super connected world that led to the explosive interest in general artificial intelligence (e.g., autonomous cars) and NLP in particular (e.g., Google translator, Apple Inc.'s Siri, movie recommenders). Herein, we review major recent developments in cNLP methods for cancer since that watershed point. We discuss their applications for translational investigation and future directions. We cover publications since the 2016 review by Yim and colleagues (10), which are: (i) focused on cNLP of EMR text related to cancer; (ii) peer-reviewed; (iii) published in English and use English EMR text; (iv) sourced from MEDLINE and major computational linguistics and machine learning venues: the annual conferences of the Association of Computational Linguistics, North American Association of Computational Linguistics, European Association of Computational Linguistics, Empirical Methods for Natural Language Processing, International Conference on Machine Learning, Neural Information Processing Systems Conference, Machine Learning for Healthcare, SemEval, International Conference for High Performance Computing, and IEEE International Conference on Biomedical Health Informatics. Our goal is to highlight recent exceptional articles with implications for the broader cancer research community; thus, this survey is not a systematic meta-review. We acknowledge that much work is taking place outside traditional academic environments (i.e., industry), and we attempt to include it to the extent it meets this survey's inclusion criteria. For ease of reading, terms and definitions are presented in Table 1.
Terms and definitions
Term . | Definition . |
---|---|
Accuracy | |\frac{{( {TP + TN} )}}{{( {TP + FP + FN + TN} )}}$| Where TP is true positive; TN is true negative; FP is false positive; and FN is false negative. |
Artificial intelligence | A process through which machines mimic "cognitive" functions that humans associate with other human minds, such as language comprehension. |
Area under the curve (AUC) | A metric of binary classification; range from 0 to 1, 0 being always wrong, 0.5 representing random chance, and 1, the perfect score. |
Artificial neural network | Computing systems that are inspired by, but not necessarily identical to, the biological neural networks that constitute human brain. |
Attribute | Facts, details, or characteristics of an entity. |
Autoencoder | A class of artificial neural networks. |
Concept mapping | A diagram that depicts suggested relationships between concepts. |
Convolutional neural network | A class of artificial neural networks. |
Decision tree | A tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. |
Deep learning | A subclass of a broader family of machine learning methods based on artificial neural networks. The designation "deep" signifies multiple layers of the neural network. |
Entities | A person, place, thing, or concept about which data can be collected. Examples in the clinical domain include diseases/disorders, signs/symptoms, procedures, medications, anatomical sites. |
F1 score | |\frac{{( {2*Recall*Precision} )}}{{( {Recall + Precision} )}}$| Values range from 0 to 1 (perfect score). |
Graphics processing unit | A specialized electronic circuit designed to perform very fast calculations needed for training artificial neural networks. |
K-nearest neighbors | A nonparametric method used for classification and regression in pattern recognition. |
Latent representation | Word representations that are not directly observed but are rather inferred through a mathematical model. |
Machine learning | The scientific study of algorithms and probabilistic models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. |
Precision | |\frac{{( {TP} )}}{{( {TP + FP} )}}$| Where TP is true positive, and FP is false positive. |
Probabilistic methods | A nonconstructive method, primarily used in combinatorics, for proving the existence of a prescribed kind of mathematical object. |
Recall | |\frac{{( {TP} )}}{{( {TP + FN} )}}$| Where TP is true positive, and FN is false negative. |
Recurrent neural network | A class of artificial neural networks. |
Rule-based system | Systems involving human-crafted or curated rule sets. |
Semantic representation | Ways in which the meaning of a word or sentence is interpreted. |
Supervised learning | Machine learning method that infers a function from labeled training data consisting of a set of training examples. |
Support vector machine | Supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. |
tensor | A mathematical object analogous to but more general than a vector, represented by an array of components that are functions of the coordinates of a space. |
Transfer learning | A machine learning technique where a model trained on one task is repurposed on a second related task. |
Unsupervised learning | Self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. |
Word embedding | The collective name for a set of language modeling and feature learning techniques in natural language processing (NLP), where words or phrases from the vocabulary are mapped to vectors of real numbers. |
Term . | Definition . |
---|---|
Accuracy | |\frac{{( {TP + TN} )}}{{( {TP + FP + FN + TN} )}}$| Where TP is true positive; TN is true negative; FP is false positive; and FN is false negative. |
Artificial intelligence | A process through which machines mimic "cognitive" functions that humans associate with other human minds, such as language comprehension. |
Area under the curve (AUC) | A metric of binary classification; range from 0 to 1, 0 being always wrong, 0.5 representing random chance, and 1, the perfect score. |
Artificial neural network | Computing systems that are inspired by, but not necessarily identical to, the biological neural networks that constitute human brain. |
Attribute | Facts, details, or characteristics of an entity. |
Autoencoder | A class of artificial neural networks. |
Concept mapping | A diagram that depicts suggested relationships between concepts. |
Convolutional neural network | A class of artificial neural networks. |
Decision tree | A tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. |
Deep learning | A subclass of a broader family of machine learning methods based on artificial neural networks. The designation "deep" signifies multiple layers of the neural network. |
Entities | A person, place, thing, or concept about which data can be collected. Examples in the clinical domain include diseases/disorders, signs/symptoms, procedures, medications, anatomical sites. |
F1 score | |\frac{{( {2*Recall*Precision} )}}{{( {Recall + Precision} )}}$| Values range from 0 to 1 (perfect score). |
Graphics processing unit | A specialized electronic circuit designed to perform very fast calculations needed for training artificial neural networks. |
K-nearest neighbors | A nonparametric method used for classification and regression in pattern recognition. |
Latent representation | Word representations that are not directly observed but are rather inferred through a mathematical model. |
Machine learning | The scientific study of algorithms and probabilistic models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. |
Precision | |\frac{{( {TP} )}}{{( {TP + FP} )}}$| Where TP is true positive, and FP is false positive. |
Probabilistic methods | A nonconstructive method, primarily used in combinatorics, for proving the existence of a prescribed kind of mathematical object. |
Recall | |\frac{{( {TP} )}}{{( {TP + FN} )}}$| Where TP is true positive, and FN is false negative. |
Recurrent neural network | A class of artificial neural networks. |
Rule-based system | Systems involving human-crafted or curated rule sets. |
Semantic representation | Ways in which the meaning of a word or sentence is interpreted. |
Supervised learning | Machine learning method that infers a function from labeled training data consisting of a set of training examples. |
Support vector machine | Supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. |
tensor | A mathematical object analogous to but more general than a vector, represented by an array of components that are functions of the coordinates of a space. |
Transfer learning | A machine learning technique where a model trained on one task is repurposed on a second related task. |
Unsupervised learning | Self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. |
Word embedding | The collective name for a set of language modeling and feature learning techniques in natural language processing (NLP), where words or phrases from the vocabulary are mapped to vectors of real numbers. |
We highlight results measured in either accuracy, harmonic mean of recall/sensitivity and precision/positive predictive value (F1 score), or AUC (trade-off between true positive and false positive rates). These performance metrics reflect a comparison against human-generated data (referred to as gold-standard annotations); thus, they capture agreement between NLP systems and humans. Gold-standard annotations are also used for training algorithms (supervised learning). The interannotator agreement (IAA) measures human performance and serves as a system performance target.
Major NLP Algorithmic Advances
The past 3 years have shown the development of a variety of methodologies for NLP with a general shift toward a particular machine learning category: deep learning (DL; ref. 23). DL techniques were initially conceived in the 1980s but not operationalized until the convergence of three critical elements: massive digital text corpora, novel but compute and data intensive algorithms, and powerful, massively parallel computing architectures currently using graphics processing units (GPU; ref. 24). For many tasks, DL is considered state-of-the-art in artificial intelligence (25–27). The key differentiator between DL and feature-rich machine learners is the concept of representation learning (28). Feature-rich algorithms require expert knowledge, linguistic, semantic, biomedical, or world, to determine the information of interest. Some examples of feature-rich learners are support vector machines (SVM) and random forests (RF; ref. 29). In the clinical domain, the engineered features are often guided by biomedical dictionaries, clinical ontologies, or biomedical knowledge from domain experts. Instead, DL models automatically discover mathematically and computationally convenient abstractions from raw data needed for classification without the need for explicitly defined features (23, 25). These representations can range from simple word representations and word embeddings (30) to complex hierarchies that capture contextual meaning and relationships between words, phrases, and other compositional derivatives. This capability of DL algorithms can potentially unmask unknown relationships buried within large quantities of data, which can be particularly advantageous in cancer research and practice (25). Furthermore, DL algorithms can uniquely take advantage of transfer learning (26), the ability to learn from data not in the target domain, and then apply this knowledge to other domains. For example, one DL model may be trained on large, openly available nonmedical text data (e.g., Wikipedia), and then this model's knowledge is applied effectively in cNLP tasks through fine tuning the model's parameters on smaller but directly relevant clinical text corpora.
Most DL architectures are built on the artificial neural network with interconnected nodes (neurons) arranged in layers (23). The variations in the arrangement and interconnections of these layers result in various elaborate networks, or architectures, suitable for addressing a variety of tasks. The most popular among these include: convolutional neural networks (CNN), optimal for data where spatial relationships encode critical information; recurrent neural networks (RNN), advantageous for sequentially ordered data (e.g., time-series data); and autoencoders, suitable for learning problems from noisy data, or data where prior information about data are partially or entirely unknown (23). There is a substantial amount of research in the general (as opposed to clinical) application of DL, demonstrating its potential in NLP (31).
Linguistic variability, combined with the abundance of medical terminology, abbreviations, synonyms, jargon, and spelling inconsistencies prevalent in clinical text, make cNLP a particularly challenging problem. DL has shown remarkable results in extracting low- and high-level abstractions from raw text data with semantic and syntactic capabilities. This ability is often accompanied by excellent performance across translational science applications (25, 32) and as highlighted below.
Latest cNLP Application Developments
Task: extracting temporality and timelines
Longitudinal representations of patients' cancer journeys are a cornerstone of translational research enabling rich studies across variables (e.g., tumor molecular profile) and outcomes (e.g., treatment efficacy). Extracting timelines from the EMR free-text has become a line of cNLP research on its own. Since 2016, under the auspices of SemEval, Clinical TempEval shared tasks have challenged the NLP research community to establish state-of-the-art methods and results for temporal relation extraction with a focus on oncology. The dataset for these shared tasks consists of 400 patients with cancer distributed evenly between colon and brain cancers, each represented by pathology, radiology, and clinical notes (the THYME corpus described in ref. 33 and available from ref. 34). The tasks consisted of identifying event expressions, time expressions, and temporal relations (see Fig. 1 for an example). The relation between the event and the document creation time is called DocTimeRel with values of BEFORE, OVERLAP, BEFORE-OVERLAP, and AFTER, which provide a course-level temporal positioning on a timeline.
Clinical TempEval example: two events, one time expression, two temporal relations, two relations to the document creation time (DocTimeRel).
Clinical TempEval example: two events, one time expression, two temporal relations, two relations to the document creation time (DocTimeRel).
Clinical TempEval 2016 (35) focused on developing methods from colon cancer EMR data and testing on colon cancer data (within-domain evaluation). The results suggest that current state-of-the-art systems perform extremely well on most event- and time expression- related tasks, gap between system performance and IAA (or human performance) < 0.05 F1. However, the temporal relation tasks remained a challenge. Systems that predict DocTimeRel relation lagged about 0.09 F1 behind IAA. For other types of temporal relations, systems lagged about 0.25 F1 behind IAA.
Clinical TempEval 2017 (36) addressed the question of how well systems trained on one cancer medical domain (colon cancer) perform in predicting timelines in another cancer medical domain (brain cancer). The results showed that is an open research question with a 0.20+ F1 drop across domains. Providing a small amount of target domain training data improved performance.
Methods employed by the Clinical TempEval participants range from classic methods (logistic regression, conditional random fields, SVMs, pattern matching) to various architectures of latest DL techniques (RNNs, CNNs with inputs of word and character embeddings). Clinical TempEval 2017 showed there was no one specific method that provides the best results, although the combination of various approaches appeared a promising path.
Outside of Clinical TempEval, experimentation with advanced DL architectures and various data streams for timeline extraction of cancer patient EMRs has intensified. Tourille and colleagues explored neural networks and domain adaptation strategies (37). Chen and colleagues (38) and Dligach and colleagues (39) dealt with simplifications of time expression representations in a neural approach. Some latest trends include DL models that combine a small portion of labeled data with unlabeled publicly available data [Google News (30) and social media] to achieve results about 0.02 F1 below IAA (40). The current best reported result is 0.684 F1 (41).
Open source systems for timeline extraction include Apache cTAKES temporal module (42), Heidel–Time (for temporal expressions and their normalization; ref. 43), and rule-based extensions of Stanford CoreNLP (44).
The task of extracting temporality from EMR clinical narrative has advanced dramatically since 2016. In the last 3 years, the performance on the Clinical TempEval test set moved from 0.573 to 0.684 F1 for finer grained temporal relations and reached 0.835 F1 for DocTimeRel. This last result enables exploring select temporally sensitive applications such as outcomes extraction, which was pointed out as one of the most challenging yet to be addressed use cases in the 2016 survey article.
Application: extracting tumor and cancer characteristics
Information extraction from pathology reports, which have a more consistent structure than other free text EMR documents, presents a tractable challenge to the field of cNLP (45). Since the 2016 survey, the oncology NLP field has moved beyond cancer stage and TNM extraction into the extraction of more comprehensive cancer and tumor attributes. Qiu and colleagues (46) presented a CNN for information abstraction of primary cancer site topography from breast and lung cancer pathology reports from the Louisiana Cancer Registry, reporting 0.72 F1. Using the same corpus, Gao and colleagues (47) boosted performance using a more elaborate DL architecture (hierarchical attention neural network). The authors reported 0.80 F1 for cancer site topography and 0.90 F1 for histologic grade. However, the authors noted significant computational demands of their DL solution.
Alawad and colleagues (48) showed that for extraction of cancer primary site, histologic grade, and laterality, training CNN to make multiple predictions simultaneously (multi-task learning) outperformed single task models. In a later study, the authors explored the computational demands of CNN cNLP models and the role of high-performance computing for achieving population-level automated coding of pathology documents to achieve near real-time cancer surveillance for cancer registry development (49). Using a corpus of 23,000 pathology reports, they reported 0.84 F1 for primary cancer site extraction across 64 cancer sites using their CNN model, significantly outperforming a random forest classifier with 0.76 F1.
Yala and colleagues (50) used boosting (51) to extract tumor information from breast pathology reports and achieved 90% accuracy for extracting carcinoma and atypia categories. Because gold-standard datasets are a necessary but resource-intensive requirement of ML algorithms, this study also investigated the minimum number of annotations needed to maintain at least 0.9 F1 without the system being pretrained. They reported this to be approximately 400. Using similar methods, Acevedo and colleagues (52) found the rate of abnormal findings in asymptomatic patients to be 7%, and to increase with age. These results are higher than previously reported, suggesting the clinical value of these algorithms over current epidemiologic methods to measure cancer incidence and prevalence. In a study of multiple diseases, Gehrmann and colleagues (25) reported an improvement in F1 score and AUC for advanced cancer using CNNs over rule-based systems.
The open source DeepPhe platform (53, 54) is a hybrid system for extracting a number of tumor and cancer attributes. It implements a variety of artificial intelligence approaches, rules, domain knowledge bases, machine learning (feature-rich and DL), to crawl the entire cancer patient chart (not restricted to pathology notes), extract, and summarize the information related to tumors and cancers and their characteristics. The IAA ranged from 0.46 to 1.00 F1, and system agreement with humans ranged from 0.32 to 0.96 F1. System highest result is on primary site extraction (0.96 F1); lowest: PR method extraction (0.32 F1).
Castro and colleagues (55) developed an NLP system to annotate and classify all BI-RADS mentions present in a single radiology report, which can serve as the foundation for future studies that will leverage automated BI-RADS annotation, providing feedback to radiologists as part of a learning health system loop (56).
Application: clinical trials matching
Clinical trials determine safety and effectiveness of new medical treatments; with the successes of recent years including new classes of therapies (e.g., immunotherapy; CAR-T cells), the clinical trial landscape has exploded. Nevertheless, adult patient participation in clinical trials remains low, especially among underrepresented minorities. This limits trial completion, generalizability, and interpretation of trial findings. Thus, there is a great deal of interest in clinical trial matching. This is not a simple problem, given the need to extract information from trial protocols written in natural language and match the findings with characteristics from individual EMRs.
Since the 2016 survey article (10), researchers have explored DL technology to identify relevant information found in patients' EMRs to establish eligibility for clinical trials. Bustos and colleagues developed a CNN, leveraging its representation learning capability, to extract medical knowledge reflecting eligibility criteria from clinical trials (57). They reported promising results using CNNs compared with state-of-the-art classification algorithms including FastText (58), SVM, and k-Nearest Neighbors (kNN). Shivade and colleagues (59) and Zhang and colleagues (60) developed SVMs to automate the classification of eligibility criteria to facilitate trial matching for specific patient populations.
Yala and colleagues (50) and Osborne and colleagues (61) used Boostexter (62) and MetaMap (13, 14) respectively on rule-based regular expressions to automatically extract relevant patient information from EMRs, predominantly free-text reports, to identify patient cohorts with characteristics of interest for clinical trials or other relevant reporting. There are also a panoply of commercial solutions emerging in this space, but our search did not reveal any publications by these commercial entities.
Application: pharmacovigilance and pharmacoepidemiology
Pharmacovigilance, drug–safety surveillance, and factors associated with nonadherence play an important role in improving patient outcomes by personalizing cancer treatments, monitoring, and understanding adverse drug events (ADE) as well as minimizing risks associated with different therapies. The 2016 survey article identifies outcomes extraction as one of the challenges for cNLP because temporality extraction plays a key role. With the advances in temporality extraction in the last three years (see section Extracting Temporality and Timelines), methods for outcomes extraction have also improved.
A variety of methods have been explored including logistic regression, SVM, random forest, decision tree, and DL to analyze EMR data to predict treatment prescription, quality of care, and health outcomes of patients with cancer. Using data from the SEER (3) cancer registry as gold-standard for cancer stages, and variables extracted from linked Medicare claims data, Bergquist and colleagues (63) classified patients with lung cancer receiving chemotherapy into different stages of severity, with a hybrid method of rules and ensemble ML algorithms. This system achieved 93% accuracy demonstrating its potential applications to study the quality of care for patients with lung cancer and health outcomes.
Survival analysis plays an important role for clinical decision support. In oncology care, the choice of treatment depends greatly on prognosis, sometimes difficult for physicians to determine. Gensheimer and colleagues (64) proposed a hybrid pipeline that combines semantic data mining with neural embeddings of sequential clinical notes and outputs a probability of >3 months life expectancy.
Yang and colleagues (65) applied a tensorized RNN on sequential clinical records to extract a latent representation from the entire patient history, and used it as the input to an Accelerated Failure Time model to predict the survival time of metastatic breast cancer patients. Yin and colleagues (66) applied word embeddings to discover topics in patient-provider communications associated with an increased likelihood of early treatment discontinuation in the adjuvant breast cancer setting. Overall, treatment toxicity extraction remains an open research area.
Shareable Resources for NLP in Oncology
Recent years have seen cancer cNLP tasks tackled occasionally at mainstream NLP conferences and affiliated workshops (in open-domain NLP, top research is preferentially presented at conferences). Although still relatively rare, this has the potential to greatly benefit cancer cNLP research, with a larger community of NLP researchers working directly on these problems in addition to the more specialized cNLP community. The prerequisite for this trend to continue is access to shareable data resources as also pointed out in the 2016 survey article. The colon and brain cancer THYME corpus was used in several general domain conference and workshop articles (37, 38, 40, 67–69), whereas a radiology report dataset from a 2007 challenge (available from ref. 70) was used in another (71), and SEER-provided (although unshared thus not available for distribution) corpus was used in yet another (72). Other work using ad hoc resources has been used for methods development but this is a less sustainable model due to the rarity of expertise in both cancer and NLP (73–75). A recently developed resource created gold-standard annotations of the semantics of sentences in notes describing patients with cancer (76). More shared resources, community challenges, and publicity for both, will likely lead to more focused development of new methods for cancer information extraction, a challenge that the community needs to address.
Application at the Point of Care
The focus of our survey article is on NLP technologies for cancer translational studies. However, we briefly review the applications of these technologies for direct patient care, which has rightfully proceeded with caution given that even small system error rates could lead to harm. Lee and colleagues (77) studied concordance of IBM Watson for Oncology, a commercial NLP-based treatment recommendation system, with the recommendations of local experts and it was 48.9%. Similar results are reported in (78, 79). Furthermore, such applications are treated as Software as Medical Device (SaMD) by the FDA, which, justifiably, is a high bar to clear (80, 81). Some cautious use cases provide assistance to physicians (82, 83) in the form of question-answering and summarization. Voice tools in health care, which represent a distinct subdomain of NLP, are primarily used for (i) documentation; (ii) commands; and (iii) interactive response and navigation to patients (84).
Implications and Future Directions
As discussed above, NLP technology for cancer has made strides since the 2016 article paper, which states that at that time “oncology-specific NLP is still in its infancy.” Given the breadth and depth of the research we surveyed in the current article, we believe the field has expanded enabled by state-of-the-art methods and abundant digital EMR data. We observe more collaborations between NLPers and oncologists, which was one of the take-away lessons from Yim and colleagues.
State-of-the-art machine learning methods require significant amounts of human-labeled data to learn from, which is expensive and time-consuming. This presents a methodologic challenge toward learning paradigms from vast unlabeled datasets (lightly supervised or unsupervised methods). Another challenge lies in the portability of the machine learners as they represent the distributions of the data they learned from. If translated to a domain with a different distribution (e.g., colorectal to brain cancer), there is a substantial drop in performance (see section Extracting Temporality and Timelines). Thus, domain adaptation remains an unsolved and hot scientific problem. Large-scale translational science is likely to cross country boundaries and harvest data from EMRs written in a variety of languages. Therefore, the cNLP research community needs to think about multilingual machine learning to enable such bold studies. On the hardware side, DL methods require vast computational resources available only to a very few and not necessarily solvable by a cloud computing environment. Last but not least, ethical considerations of the application of these powerful technologies should be discussed, at the bare minimum whether the underlying data on which machine learners are trained represents the whole of human diversity.
In research, real-world big data have great potential to improve cancer care. Gregg and colleagues present a risk stratification research for prostate cancer (85). The utilization of real-world big data is a key focus area of the NCI (86). SEER and NCDB, the two major cancer registry databases in the United States, have limitations in terms of coverage, accuracy, and granularity that introduce bias (3, 4, 87, 88, 89, 90). Currently, database building requires manual annotation of clinical free-text, which is resource intensive and prone to human error. cNLP can support more rapid, large-scale, and standardized database development. Automated, semiautomated, and accurate identification of cancer cases will be particularly helpful in studying underrepresented patient populations and rare cancers. In addition, cNLP can facilitate analysis of unstructured data that are poorly documented in databases but widely accepted to be critical for prognostication and management decision-making, most notably patient-reported outcomes (91). Our hope is that larger, more accurate, and granular clinical databases can be integrated with -omics databases to enable translational research to better understand oncologic phenotype relationships. This data convergence has the potential to enable new insights about cancer initiation, progression, metastasis, and response to treatment.
Although NLP has yet to make major inroads in the clinical setting, some of the potential applications are clear. Direct extraction of cancer phenotypes from source data (pathology and radiology reports) could reduce redundancy and prevent ambiguity within a patient's chart, minimizing confusion and medical errors. Summarization and information retrieval applications can reduce search burden and enable clinicians to spend more time with their patients. Clinical decision support tools could help reduce the increasingly burdensome cognitive load placed on clinicians, although the results reported thus far by efforts such as IBM Watson for Oncology raise serious concerns about what the bar for accuracy of clinical recommendations should be for routine use. In fact, these results are a cautionary tale of the challenges of domain adaptation; the software was widely reported to have been trained on hypothetical cases at a highly specialized cancer center, leading to incorrect and possibly unsafe recommendations (92). At this time, NLP technology is not yet ripe for direct patient care except in carefully observed scenarios.
Conclusion
cNLP has the potential to affect almost all aspects of the cancer care continuum, and multidisciplinary collaboration is necessary to ensure optimal advancement of the field. As there are few individuals with expertise in both oncology and NLP, clinical oncologists, basic and translational scientists, bioinformaticians, and epidemiologists should work with computer scientists to identify and prioritize the most important clinical questions and tasks that can be addressed with this technology. Furthermore, oncology subject matter experts will be needed to create gold datasets. Once an NLP technology is developed, oncologists and cancer researchers should take a primary role in evaluating it to determine its utility for research and their clinical value. Although standards for clinical evaluation of software, including artificial intelligence systems, are evolving (93), NLP tools that directly affect management decisions should be considered for evaluation in a trial setting by clinical investigators familiar with the technology and FDA guidelines (80). In partnership, computer scientists, oncology researchers, and clinicians can take full advantage of the recent advances in NLP technology to fully leverage the wealth of data stored and rapidly accumulating in our EMRs.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Acknowledgments
The work was supported by funding from U24CA184407 (NCI), U01CA231840 (NCI), R01 LM 10090 (LM), and R01GM114355 (NIGMS). This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the U.S. Department of Energy (DOE) and the National Cancer Institute (NCI) of National Institutes of Health. This work was performed under the auspices of the U.S. Department of Energy by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of the manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).