Real-world evidence (RWE), conclusions derived from analysis of patients not treated in clinical trials, is increasingly recognized as an opportunity for discovery, to reduce disparities, and to contribute to regulatory approval. Maximal value of RWE may be facilitated through machine-learning techniques to integrate and interrogate large and otherwise underutilized datasets. In cancer research, an ongoing challenge for RWE is the lack of reliable, reproducible, scalable assessment of treatment-specific outcomes. We hypothesized a deep-learning model could be trained to use radiology text reports to estimate gold-standard RECIST-defined outcomes. Using text reports from patients with non–small cell lung cancer treated with PD-1 blockade in a training cohort and two test cohorts, we developed a deep-learning model to accurately estimate best overall response and progression-free survival. Our model may be a tool to determine outcomes at scale, enabling analyses of large clinical databases.

Significance:

We developed and validated a deep-learning model trained on radiology text reports to estimate gold-standard objective response categories used in clinical trial assessments. This tool may facilitate analysis of large real-world oncology datasets using objective outcome metrics determined more reliably and at greater scale than currently possible.

This article is highlighted in the In This Issue feature, p. 1

Most therapeutic advances in oncology result from large, prospective clinical trials. However, less than 10% of adults with cancer participate in clinical trials (1, 2). The vast majority of potential clinical evidence outside of clinical trials is unexamined or underexamined. Thus, mechanisms for generating new data and insight are inefficient. Real-world evidence (RWE), leveraging data from the tremendous scale of patients treated outside of clinical trials, is an important, emerging opportunity for discovery (3).

Large-scale RWE reports thus far have largely focused on variables such as demographic features, clinicopathologic details, treatment records, and overall survival using structured data points that can be readily retrieved from the medical record (4). However, a particular challenge in maximizing RWE in translational and clinical cancer research is obtaining reliable assessments of treatment-specific outcomes (e.g., objective response, progression-free survival). In clinical practice, clinical radiology reports typically provide a qualitative narrative, but do not make standardized, quantified assessments of response or progression. As a result, in prospective clinical trials, standardized assessments such as RECIST (5) are employed. Although RECIST can sometimes be applied to retrospective datasets (6, 7), the substantial time and effort by trained radiologists to perform these reads retrospectively limits the scale of analyses that are feasible. Alternatives can include proxy endpoints (such as time to treatment failure; ref. 8) or manual review of reports by trained data abstractors (9, 10), but are still limited by imprecision, interobserver variability, and scalability.

To address the need of scalable methods to retrospectively assess objective treatment-specific outcomes, we hypothesized that a machine-learning model could be developed to estimate RECIST using clinical radiology text reports and trained by gold-standard RECIST reads performed by trained radiologists.

To explore this hypothesis, 2,977 text reports from 453 patients with non–small cell lung cancer (NSCLC) treated with PD-1/PD-L1 blockade were separated into independent training (n = 361, 80%) and internal test (n = 92, 20%) cohorts from Memorial Sloan Kettering Cancer Center (MSK; Fig. 1A; Supplementary Table S1). RECIST reads performed on all patients by primary review of the cross-sectional images by trained thoracic radiologists filling out a standardized RECIST form (Supplementary Fig. S1) served as ground truth for best overall response (BOR), occurrence of progression, and date of progression.

Figure 1.

DL model methodologic approach and performance in training set. A, Patients with metastatic NSCLC who received immune checkpoint inhibitors (MSK cohort) were identified and divided into training (n = 361, 80%) and test (n = 92, 20%) cohorts. Clinical text reports from imaging studies performed at baseline and on treatment were used as input for training the supervised learning model (DL model). Ground truth for each patient was defined by thoracic radiologist–assessed RECIST reviewing scans. The trained model was evaluated in the internal test cohort, and again with an external test cohort (MGH; n = 97). B, Three different methods of inputting the text reports were assessed (Methods A, B, and C). Briefly, Method A compared the baseline scan to an amalgam of all follow-up scan reports, and Methods B and C compared paired scan reports. Please see the Methods section for more details. A fully connected neural network was employed to develop the deep natural language processing model and test the performance of each method. The outputs were best overall response (BOR), progression during treatment (Yes/No), and specific date of progression (the latter two were used to define progression-free survival [PFS]). tanh, hyperbolic tangent activation function; CR, complete response; PR, partial response; SD, stable disease; POD, progression of disease; PFS, progression-free survival. C, Confusion matrices of the three methods of scan review explored in development of the DL model color coded by predicted label/true label for that category. Sensitivity of each category is labeled along the diagonals. D, ROCs for the DL model using Methods A (AUC, 0.90), B (AUC, 0.84), and C (AUC, 0.81) to determine BOR (CR/PR vs. SD/POD). E, ROC for the DL model using Methods A (AUC, 0.79), B (AUC, 0.69), and C (AUC, 0.68) to determine occurrence of progression (Yes/No). F, Confusion matrix of progression occurring (Yes/No) using Method A. G, Concordance of time to progression using Method A as input for the DL model compared with gold-standard RECIST (Spearman r coefficient: 0.84, P < 0.001). Dots represent individual patients.

Figure 1.

DL model methodologic approach and performance in training set. A, Patients with metastatic NSCLC who received immune checkpoint inhibitors (MSK cohort) were identified and divided into training (n = 361, 80%) and test (n = 92, 20%) cohorts. Clinical text reports from imaging studies performed at baseline and on treatment were used as input for training the supervised learning model (DL model). Ground truth for each patient was defined by thoracic radiologist–assessed RECIST reviewing scans. The trained model was evaluated in the internal test cohort, and again with an external test cohort (MGH; n = 97). B, Three different methods of inputting the text reports were assessed (Methods A, B, and C). Briefly, Method A compared the baseline scan to an amalgam of all follow-up scan reports, and Methods B and C compared paired scan reports. Please see the Methods section for more details. A fully connected neural network was employed to develop the deep natural language processing model and test the performance of each method. The outputs were best overall response (BOR), progression during treatment (Yes/No), and specific date of progression (the latter two were used to define progression-free survival [PFS]). tanh, hyperbolic tangent activation function; CR, complete response; PR, partial response; SD, stable disease; POD, progression of disease; PFS, progression-free survival. C, Confusion matrices of the three methods of scan review explored in development of the DL model color coded by predicted label/true label for that category. Sensitivity of each category is labeled along the diagonals. D, ROCs for the DL model using Methods A (AUC, 0.90), B (AUC, 0.84), and C (AUC, 0.81) to determine BOR (CR/PR vs. SD/POD). E, ROC for the DL model using Methods A (AUC, 0.79), B (AUC, 0.69), and C (AUC, 0.68) to determine occurrence of progression (Yes/No). F, Confusion matrix of progression occurring (Yes/No) using Method A. G, Concordance of time to progression using Method A as input for the DL model compared with gold-standard RECIST (Spearman r coefficient: 0.84, P < 0.001). Dots represent individual patients.

Close modal

We implemented a fully connected deep natural language processing model (DL model) for response classification (Fig. 1B). The model input consisted only of the clinical text reports describing the findings of the cross-sectional imaging. The model was presented three distinct approaches for organizing input text report data. Using an attention map as a visualization method for a representative example, we confirmed expected words and phrases were used by the model (Supplementary Fig. S2A), and the loss function quickly demonstrated stability in training for both tasks (Supplementary Fig. S2B and S2C). During training, Method A, comparing baseline scan reports to an amalgam of all follow-up scan reports, demonstrated the highest sensitivity and specificity (Fig. 1C; Supplementary Fig. S3A). When BOR was considered as two categories [complete response or partial response (CR/PR), stable disease or progression of disease (SD/POD)], Method A also had the best performance (AUC, 0.9; Fig. 1D) and accurately determined CR/PR in 84% of patients and SD/POD in 96% of patients. The most common errors occurred between neighboring response groups (e.g., PR incorrectly predicted to be SD or SD incorrectly predicted to be POD), whereas very few patients were misclassified across extreme response groups (e.g., PR incorrectly predicted to be POD; Fig. 1C).

The DL model performed best when using only scan reports that had been used for RECIST reads, although accuracy remained high when using all radiology reports during the treatment period (Supplementary Fig. S3B).

We next evaluated the DL model in determining progression-free survival (PFS). Interpretation of PFS requires knowledge of both whether progression occurred (i.e., event vs. censor) and the time at which progression occurred. To assess whether progression occurred during the treatment period, Method A again performed best in determining progression event versus censor (AUC, 0.79; Fig. 1E and F). The model accurately categorized 93% of patients with progression event by RECIST (n = 283, 78% of training cohort) and 65% of patients without progression event by RECIST (n = 78, 22%). To assess the date of progression, Method A could not be applied because it considered on-treatment scans in aggregate rather than as discrete events. Therefore, only Methods B and C, which considered each on-treatment scan report separately, were evaluated. The task of determining the date of progression used to calculate PFS was binarized by using the date of the follow-up scan that had the highest probability greater than 0.5 that the scan showed progression. Method B was more accurate, with the exact date of progression matching RECIST in 65% of patients (185/283). Ultimately, to optimize the model prediction of PFS that reflects RECIST progression determination using multiple scans, an ensemble method combining Methods A and B was developed to determine PFS (Supplementary Table S2). With this approach, exact date of progression matching RECIST was determined in 70% of patients (246/351), and up to 79% of patients within 2 months of RECIST date of progression (overall Spearman correlation for our model vs. RECIST-determined PFS 0.83, P < 0.001, two-tailed, Fig. 1G; log-rank HR = 1.0; Supplementary Fig. S4A).

The performance of the model was next assessed in the held out, internal test cohort (n = 92, 20% of MSK cohort) using Method A for BOR and ensemble Method A/B for PFS. Similar to the training cohort, the DL model correctly estimated RECIST CR/PR in 84%, SD in 80%, and POD in 89% of cases (specificity 96%, 88%, and 93%, respectively) and accurately differentiated between SD and POD (80% and 90% respectively; Fig. 2A and B). The DL model correctly determined progression had occurred in 85% of cases. The date of progression was determined exactly in 73% cases (log-rank HR = 1.0; Supplementary Fig. S4B) and up to 80% when progression by the model was predicted within 2 months (Fig. 2B).

Figure 2.

Performance of the DL model in the internal test cohort and correlation with clinical outcomes. A, Confusion matrix of CR/PR vs. SD vs. POD for the internal test cohort color-coded by predicted label/true label for that category. Sensitivity of each category is labeled along the diagonals. B, Relative accuracy of the DL model in the training cohort compared to the internal test cohort (CR/PR: 84% vs. 84%; SD/POD: 96% vs. 96%; Progression Y/N: 86% vs. 85%; exact progression date: 70% vs. 73%; progression date within 2 months 79% vs. 80%, respectively). C, For the training cohort, overall survival of the DL model predicted responders and nonresponders was similar to RECIST estimated responders and nonresponders [DL model: responders vs. nonresponders, median OS: not reached (NR) vs. 11.7 months, HR = 0.22, P < 0.001; RECIST: median OS: NR vs. 11.7 months, HR = 0.24, P < 0.001]. D, In the internal test cohort, overall survival of the DL model predicted responders and nonresponders was similar to RECIST estimated responders and nonresponders (DL model: responders vs. nonresponders HR = 0.20, P < 0.001; RECIST: HR = 0.28, P < 0.001).

Figure 2.

Performance of the DL model in the internal test cohort and correlation with clinical outcomes. A, Confusion matrix of CR/PR vs. SD vs. POD for the internal test cohort color-coded by predicted label/true label for that category. Sensitivity of each category is labeled along the diagonals. B, Relative accuracy of the DL model in the training cohort compared to the internal test cohort (CR/PR: 84% vs. 84%; SD/POD: 96% vs. 96%; Progression Y/N: 86% vs. 85%; exact progression date: 70% vs. 73%; progression date within 2 months 79% vs. 80%, respectively). C, For the training cohort, overall survival of the DL model predicted responders and nonresponders was similar to RECIST estimated responders and nonresponders [DL model: responders vs. nonresponders, median OS: not reached (NR) vs. 11.7 months, HR = 0.22, P < 0.001; RECIST: median OS: NR vs. 11.7 months, HR = 0.24, P < 0.001]. D, In the internal test cohort, overall survival of the DL model predicted responders and nonresponders was similar to RECIST estimated responders and nonresponders (DL model: responders vs. nonresponders HR = 0.20, P < 0.001; RECIST: HR = 0.28, P < 0.001).

Close modal

In both the training cohort and internal test cohort, the impact of response categories determined by the DL model versus RECIST on overall survival (OS) was nearly identical. In the training cohort, the HR for improved OS in patients with CR/PR versus SD/POD by the DL model was 0.19 [median not reached (NR) vs. 11.6 months, P < 0.001] versus 0.24 by RECIST (median NR vs. 11.7 months, P < 0.001; Fig. 2C). In the test cohort, the HR was 0.20 by the DL model (median 53 vs. 11 months, P < 0.001) and 0.28 by RECIST (median 53 vs. 12 months, P < 0.001; Fig. 2D).

To evaluate the generalizability of the DL model, we next evaluated an independent, external test cohort from Massachusetts General Hospital (MGH) consisting of 1,234 scan reports from 97 patients with lung cancer treated with PD-1/PD-L1 blockade (Supplementary Table S1). The DL model performance was assessed without further retraining (sensitivity/specificity: CR/PR 69%/94%, SD 43%/72%, POD 70%/75%; Fig. 3A). The accuracy of determining responders (RECIST-defined CR/PR) was 69% (20/29) versus 94% (64/68) for nonresponders (RECIST-defined SD/POD; Fig. 3B). Similar to the performance in the MSK cohorts, sensitivity and specificity were lower in delineating SD. Accuracy was lower in delineating the SD category specifically (43%, 12/28), whereas POD was accurately determined in 70% (28/40). The model accurately determined progression event or not in 82% of cases. Exact PFS date was correctly identified in 59% of cases (log-rank HR = 1.0; Supplementary Fig. S4C) with a total 82% of cases with a correctly predicted RECIST PFS date within 2 months. The overall survival impact of the DL model–determined response categories again closely mirrored RECIST (OS of the DL model PR/CR vs. SD/POD HR 0.23, median NR vs. 9.9 months, P < 0.001; OS of RECIST PR/CR vs. SD/POD HR 0.19, median NR vs. 9.3 months, P < 0.001; Fig. 3C).

Figure 3.

Performance of the DL model in the external test cohort (n = 97). A, Confusion matrix of CR/PR versus SD versus POD for the external test cohort color coded by predicted label/true label for that category. Sensitivity of each category is labeled along the diagonals. B, Relative accuracy of the DL model in the external test cohort (CR/PR: 69%; SD/POD: 94%; Progression Y/N: 83%; exact progression date: 62%; PFS date within 2 months: 83%) compared with the training and internal test cohort. C, Overall survival in the DL model predicted responders and nonresponders in external test cohort is similar to RECIST responders versus nonresponders (RECIST: HR = 0.20, P < 0.001 vs. DL model: HR = 0.3, P < 0.001).

Figure 3.

Performance of the DL model in the external test cohort (n = 97). A, Confusion matrix of CR/PR versus SD versus POD for the external test cohort color coded by predicted label/true label for that category. Sensitivity of each category is labeled along the diagonals. B, Relative accuracy of the DL model in the external test cohort (CR/PR: 69%; SD/POD: 94%; Progression Y/N: 83%; exact progression date: 62%; PFS date within 2 months: 83%) compared with the training and internal test cohort. C, Overall survival in the DL model predicted responders and nonresponders in external test cohort is similar to RECIST responders versus nonresponders (RECIST: HR = 0.20, P < 0.001 vs. DL model: HR = 0.3, P < 0.001).

Close modal

We also evaluated the performance of the DL model against surrogate endpoints commonly used currently in RWE, time to treatment discontinuation (TTD) and manual radiology record review by a trained clinician. Among gold-standard responders (RECIST CR/PR), time-to-treatment discontinuation was significantly shorter than PFS by RECIST (HR 0.6, median 18.3 vs. 11.3 months, P = 0.003), whereas PFS by the DL model closely mirrored PFS by RECIST (HR 1.2, P = 0.5). Among nonresponders (RECIST SD/POD), RECIST-defined PFS, DL model–defined PFS, and TTD were similar (Fig. 4A). The same conclusions were drawn in the MGH external test cohort (Fig. 4B).

Figure 4.

Comparison of RECIST with most commonly reported real-world data endpoints. A, Curves depict comparison of outcomes defined by time to treatment discontinuation (TTD), DL model-determined PFS, or RECIST-defined PFS, categorized on the basis of response or nonresponse to treatment in the training cohort. Among responders, TTD was different from RECIST-defined PFS (P = 0.003) and DL model–defined PFS (P < 0.001), whereas PFS determined by the DL model versus RECIST was similar (P = 0.6). Among nonresponders, TTD, the DL model, and RECIST all performed similarly. B, In the external test cohort, outcomes defined by treatment discontinuation and progression-free survival (DL model PFS or RECIST PFS). Among responders, TTD was different from RECIST-defined PFS (P < 0.001) and DL model PFS (P = 0.001), whereas PFS determined by the DL model versus RECIST was similar (P = 0.8). In nonresponders, TTD, the DL model, and RECIST performed similarly. C, Manual review of 92 patients was performed by a trained medical oncologist (K.C. Arbour) and the DL model was compared with RECIST. Each individual patient is represented as a column and assessed across the three outcomes of interest (BOR, progression at any time [Y/N], PFS). For each outcome, the DL model and manual review were compared to RECIST.

Figure 4.

Comparison of RECIST with most commonly reported real-world data endpoints. A, Curves depict comparison of outcomes defined by time to treatment discontinuation (TTD), DL model-determined PFS, or RECIST-defined PFS, categorized on the basis of response or nonresponse to treatment in the training cohort. Among responders, TTD was different from RECIST-defined PFS (P = 0.003) and DL model–defined PFS (P < 0.001), whereas PFS determined by the DL model versus RECIST was similar (P = 0.6). Among nonresponders, TTD, the DL model, and RECIST all performed similarly. B, In the external test cohort, outcomes defined by treatment discontinuation and progression-free survival (DL model PFS or RECIST PFS). Among responders, TTD was different from RECIST-defined PFS (P < 0.001) and DL model PFS (P = 0.001), whereas PFS determined by the DL model versus RECIST was similar (P = 0.8). In nonresponders, TTD, the DL model, and RECIST performed similarly. C, Manual review of 92 patients was performed by a trained medical oncologist (K.C. Arbour) and the DL model was compared with RECIST. Each individual patient is represented as a column and assessed across the three outcomes of interest (BOR, progression at any time [Y/N], PFS). For each outcome, the DL model and manual review were compared to RECIST.

Close modal

We compared the performance of the DL model to manual review of the reports, which is often done for real-world retrospective studies. Text reports that served as the input of the model from the internal test MSK cohort (n = 92) were manually reviewed by a trained medical oncologist. The average time spent per case was 2.1 minutes (standard deviation, 1.3 minutes). Similar to the DL model performance, manual review correctly predicted RECIST BOR in 85/92 cases (92%, 95% CI, 85%–97%; vs. DL model 93%, 95% CI, 86%–97%), determination of progression was correct in 83/92 cases (90%, 95% CI, 82%–95%; vs. DL model 85%, 95% CI, 76%–91%), and exact date of RECIST progression was correct in 73/92 cases (79%, 95% CI, 70%–87%; vs. DL model 86%, 95% CI, 76%–91%) (Fig. 4C).

In conclusion, we have developed a deep-learning model to estimate RECIST response assessments using text of clinical radiology reports in patients with advanced NSCLC treated with PD-1/PD-L1 blockade. In addition to determining best overall response, the model accurately evaluated the occurrence of progression and date of progression, enabling assessment of RECIST PFS. Response assessments predicted by the DL model show close similarity to RECIST response categorization with respect to the long-term impact on overall survival. The model's accuracy in the external test cohort suggests performance is not dependent on institution-specific radiology reporting style, although other work is needed to prove further generalizability to other diseases and other treatment regimens.

Deep learning has demonstrated great promise in medical image classification and natural language processing questions, including in radiology (11, 12). Prior studies in the field of natural language processing and radiology have used models to extract data related to RECIST criteria from clinical text reports, such as identifying (13) and pairing (14) words associated with measurements. To our knowledge, the DL model approach that we report here is the first such effort using machine learning to estimate RECIST response from clinical text reports. A model such as this has the potential to solve one of the key rate-limiting steps in developing large-scale, treatment-specific effectiveness data for analysis of RWE. In previous efforts, manual data abstraction has been employed, although this approach can be resource- and time-intensive. For example, the AACR Project GENIE (15) database, a multi-institution international data-sharing consortium-created genomic database, has 12,525 cases of NSCLC. It would take hundreds or thousands of hours of manual curators' time to determine response to a single line of therapy for each patient. In using a deep-learning model such as this, the efficiency to evaluate response on a cohort this size could be substantially improved, facilitating the feasibility of similar large-scale efforts that are likely to become routine in the future. A machine-learning model was developed to predict the data abstractor's assessment of clinical reports from a single institution (16), although it was not benchmarked against gold-standard RECIST outcomes used in clinical trials and for regulatory approval.

A limitation of this approach is our inability to determine at scale in real time what words or phrases the model focused on. However, we have provided an example attention map to confirm that words and phrases that we expect to be important to our classification task are what drew the model's attention as well (Supplementary Fig. S2A). The model performed well on internal holdout and independent external test cohorts, further highlighting its potential usefulness. Although we validated the model using an independent, external cohort, our approach will need further external validation prior to use in other external cohorts. Of note, this model is intended to be exclusively applied to estimate summary efficacy outputs by RECIST using radiology text reports for the purposes of retrospective research and is not intended to mirror the specific methodology of RECIST nor replace any prospective uses of RECIST in clinical trials. We also acknowledge there remain opportunities for further model refinement, including improvements in accuracy of predicting specific date of progression, which we anticipate will be possible with larger cohorts of patients. Because of the low rate of pseudoprogression in NSCLC (including no cases in our analysis) and the routine use of standard RECIST in the pivotal trials of PD-1/PD-L1 blockade in lung cancer, our model did not evaluate alternative response criteria (17). In particular, our model is not suited for assessment of pseudoprogression. Further study, refinement, and validation are needed to investigate the role of natural language processing models in guiding real-time RECIST response determination.

These initial results offer a promising proof of principle that machine learning methods can harness the vast amounts of clinical data that remain generally an untapped resource outside of clinical trials. Our DL model demonstrates that this approach is both feasible and accurate, and a novel way to consider analysis of treatment outcomes in the real world.

Patients

Retrospective review was performed under an institutional review board (IRB)–approved clinical research protocol at MSK and an IRB-approved clinical research protocol at MGH and conducted in accordance with U.S. Common Rule and the Federal-wide Assurance for the Protection of Human Subjects (FW00004998). Patients used in this analysis were treated at MSK or MGH. In the MSK cohort, consecutive imaging reports from a total of 453 patients with advanced NSCLC treated with a PD-1 or PD-L1 inhibitor as first- or second-line therapy from July 2012 to November 2016 were included in the analysis. The baseline patient characteristics of the patients included in the analysis are outlined in Supplementary Table S1. Database lock for PFS and OS analysis was performed in October 2018. In the MGH cohort, imaging reports from a total of 97 patients with advanced NSCLC treated with a PD-1 or PD-L1 inhibitor were included. Patients were treated between June 2013 and September 2019. Database lock for PFS and OS analysis was performed in November 2019.

Radiology Reports

In the MSK cohorts, a total of 2,977 clinical imaging reports from the 453 patients in the cohort were included in the analysis. The clinical text reports were used as inputs for our model. In addition to clinical reports, RECIST reads were performed by trained thoracic radiologists (A.J. Plodkowski, M.S. Ginsberg, and J. Girshman) on all patients by primary review of cross-sectional imaging. For each patient, a thoracic radiologist (blinded to the outcome for retrospective reads) compared the baseline pretreatment cross-sectional scan with each subsequent on-treatment scan using RECIST v1.1 (5). Supplementary Figure S1 shows an example of the standardized form used by radiologists to record RECIST results. Specifically, RECIST-determined BOR, occurrence of progression, and date of progression served as the ground truth during training and the gold standard in the test sets. Non–cross-sectional imaging reports (e.g., chest X-ray) were not included for review. RECIST reads were performed prospectively as part of clinical trials in 199 patients (43%) and retrospectively in 254 (56%) patients treated with standard-of-care therapy. Of text radiology reports, 2,279 (77%) were used when determining RECIST reads and 698 (23%) were not considered when determining RECIST BOR and progression-free survival. The median days between scans in this cohort was 53 days [interquartile range (IQR) 41–67 days].

In the MGH cohort, a total of 1,238 clinical imaging reports from 97 patients in the cohort were included in the analysis. As with the MSK cohort, RECIST reads were performed by trained thoracic radiologists (including S.R. Digumarthy) prospectively in patients treated on clinical trials (58%, 56/97) and retrospectively in patients treated with standard-of-care therapy (42%, 31/97). The median days between scans in this cohort was 42 days (IQR, 33–56 days).

DL Methodology

The group of patients in the MSK cohort were separated into independent training (n = 361, 80%) and internal test (n = 92, 20%) cohorts. The test cohort was held out, blinded, and not analyzed until the model derived from the training cohort was finalized. The algorithm was tasked with the following goals: to determine objective response category [categorized as two groups (CR/PR vs. SD/POD) or three groups (CR/PR, SD, or POD)], to determine whether there was progression of disease at any point (binary, event vs. censor), and, if progression occurred, the exact date of progression to determine progression-free survival (continuous time in days since beginning therapy to date of scan progression or last scan if censored). Given the task was focused on determining response in radiology text reports, the model did not take into account mortality, and patients who died without radiologic progression were considered censored for the purposes of training and validating the progression-free survival analyses.

Clinical text reports describing the findings of the cross-sectional imaging in the training cohort were classified as either baseline, on-treatment, or progression. In the test cohorts, clinical text reports were classified only as baseline or on-treatment (as the model was tasked with identifying progression). Three different methods for organizing the input were explored in the training cohort to identify the optimal way to organize paired longitudinal data (Fig. 1B). All three methods input the same sum total information but in different combinations, mirroring how radiologists examine scans to determine response and progression. Method A sought to take a “big picture” approach by aggregating text from all follow-up scans. Method B paired each scan after the baseline scan to the one immediately preceding it. Method C paired each scan with the original baseline scan. These three different methods were then compared to determine the accuracy of these approaches as compared with RECIST response assessments and progression.

For the model, we implemented a fully connected DL model in Python v3.7.1 using TensorFlow (18). The network's architecture consisted of an encoding layer, an interaction layer, and two fully connected layers with a hyperbolic tangent (tanh) activation function, followed by an output layer with softmax activation function (Fig. 1B).

Encoding Layer.

This layer accepted a pair of tokenized inputs for comparison. In Method A, all follow-up scans were combined and compared to baseline scans. In Method B, the first follow-up scan was compared with the baseline scan and each follow-up scan was compared with the scan immediately preceding it. In Method C, each subsequent follow-up scan was compared separately to the baseline scan. Each clinical imaging report was treated as a sequence of tokens and embedded into the vector space using Global Vectors for Word Representation (GloVe) pretrained embeddings (19).

Interaction Layer.

In this layer, two vector representations were concatenated (to capture the information in each vector), added (to capture the supplementary relationship between two vectors), subtracted (to capture the difference between two vectors), and multiplied (to capture the similarity between two vectors). All five output vectors were then combined into a single vector, which fully represented the interaction between the two inputs.

Fully Connected Layers.

These layers use the interaction layer's output to determine which features correlate with a particular output class. In these layers, the output of the interaction layer was passed to two fully connected layers after a tanh activation function.

Output Layer.

The output of the fully connected layers proceeded through a softmax activation function to determine a score that represents the probability of the predicted class of interest. For the BOR task, there were three output classes: CR/PR, SD, and POD. For PFS and predicting progression date tasks, there were two output classes: Yes and No. The outputs for different tasks were predicted using different networks, one output for each task.

For Method A, given each patient had only one input pair, the final prediction was the class with the highest probability. For Methods B and C, each patient had multiple input scan pairs each leading to an output probability. The final prediction consisted of the class with the highest combined probabilities from all the prediction outputs for each task. For predicting progression, the final predicted PFS date was the date of the follow-up scan that has the highest probability greater than 0.5 in the “Yes” class among all the output probabilities of the “Yes” class corresponding to the patient. If none of the probabilities were greater than 0.5, a patient was considered to not have progressed.

Implementation Details.

The text in the report was lower-cased and tokenized using Natural Language Toolkit (NLTK). We applied dropout regularization (20) of 0.8 to avoid overfitting and trained all parameters using the Adam optimizer (21). The training cohort was used to tune performance of the model. We experimented with different learning rates (0.001, 0.005, 0.01), batch sizes (1, 2, 4, 8), and nodes per layer (200, 300, 500) and chose the configuration that achieved the highest accuracy using a cross-validation method where 20% of the training set was randomly extracted and used for repeated runs.

Manual Review

As an additional test of performance, we compared the results of our model to manual review. Manual review of reports by trained data abstractors is often what is done for real-world retrospective studies, which we aimed to model with our manual review. In the internal test cohort (n = 92), manual review of text reports of imaging tests was also conducted by a trained medical oncologist and clinical investigator (K.C. Arbour). Similar to the DL model, only radiology text reports were considered and compared with RECIST determination of BOR and progression as the “gold standard.” No other pieces of the medical records were reviewed as a part of the manual review to supplement the determination of outcomes. The investigator was also blinded to the DL model and patient outcomes.

Statistical Analysis

No formal power calculations were performed to determine sample sizes. However, our training curves showed that the training loss for predicting BOR stabilized after epoch 60 and the training loss for predicting PFS stabilized after epoch 50 (Supplementary Fig. S2B and S2C). Kaplan–Meier methodology from the time of start of therapy was used to estimate overall survival and progression-free survival. Comparisons of survival curves were made using log-rank tests. Receiver operating characteristic (ROC) curves were used to visualize the trade-off between sensitivity and specificity. Statistical analyses were performed using Prism v8.0 (GraphPad Software) and Python v3.7.1 (Python Software Foundation, https://www.python.org/) using matplotlib. Scikit-learn was used to plot ROC curves and calculate the area under the ROC curve (AUC)—code and a reduced dataset are available at https://github.com/luoj2/mlRECIST.

K.C. Arbour reports personal fees from AstraZeneca and nonfinancial support from Takeda and Novartis outside the submitted work. J. Luo reports grants from NIH (T32-CA009207 and K30-UL1TR00457), Conquer Cancer Foundation of the American Society of Clinical Oncology (young investigator award), and personal fees from Targeted Oncology during the conduct of the study. K.B. Huang reports personal fees from Thrive Earlier Detection Corp. and Nextech Invest Ltd. outside the submitted work. S.R. Digumarthy reports other from Merck (independent image analyst for clinical trial through hospital), other from Pfizer (independent image analyst for clinical trial through hospital), Bristol Myers Squibb (independent image analyst for clinical trial through hospital), Roche (independent image analyst for clinical trial through hospital), Polaris (independent image analyst for clinical trial through hospital), Cascadian (independent image analyst for clinical trial through hospital), AbbVie (independent image analyst for clinical trial through hospital), Gradalis (independent image analyst for clinical trial through hospital), Clinical Bayer (independent image analyst for clinical trial through hospital), Zai Laboratories (independent image analyst for clinical trial through hospital), Care Jensen (independent image analyst for clinical trial through hospital); personal fees from Siemens (honorarium); and grants from Lunit Inc. (research support) outside the submitted work. M.G. Kris reports personal fees and other from AstraZeneca (travel) and Pfizer (travel); nonfinancial support and other from Genentech (editorial support); personal fees from Regeneron, and personal fees and other from Daiichi-Sankyo (travel) outside the submitted work. G.J. Riely reports grants from Roche/Chugai (institutional research award); grants and nonfinancial support from Pfizer (institutional research award as well as editorial support for report of the results of research studies), Novartis (institutional research award as well as editorial support for report of the results of research studies), Takeda (institutional research award as well as editorial support for report of the results of research studies), and Mirati (institutional research award as well as editorial support for report of the results of research studies) outside the submitted work. A. Yala reports personal fees from Janssen R&D outside the submitted work. J.F. Gainor reports grants and personal fees from Novartis; personal fees from Bristol Myers Squibb, Genentech, Takeda, Loxo/Lilly, Blueprint, Oncorus, Regeneron, Gilead, AstraZeneca, EMD Serono, Pfizer, Incyte, Merck, Agios, Amgen, Array, Mirati; and other from Ironwood (spouse - employee with equity) outside the submitted work. M.D. Hellmann reports grants, personal fees, and nonfinancial support from Bristol Myers Squibb; personal fees from Genentech/Roche, Nektar, Syndax and Mirati; personal fees and nonfinancial support from Merck, AstraZeneca, Shattuck Labs, Immunai; personal fees from Blueprint Medicines and Achilles; personal fees and nonfinancial support from Arcus; and nonfinancial support from Eli Lilly during the conduct of the study; in addition, M.D. Hellmann has a patent for PCT/US2015/062208 pending and licensed to PGDx. No potential conflicts of interest were disclosed by the other authors.

K.C. Arbour: Conceptualization, data curation, formal analysis, validation, investigation, visualization, methodology, writing-original draft, writing-review and editing. A.T. Luu: Conceptualization, data curation, software, formal analysis, validation, investigation, visualization, methodology, writing-original draft, writing-review and editing. J. Luo: Conceptualization, data curation, software, formal analysis, validation, investigation, visualization, methodology, writing-original draft, writing-review and editing. H. Rizvi: Conceptualization, data curation, investigation, visualization, writing-review and editing. A.J. Plodkowski: Data curation, investigation, writing-review and editing. M. Sakhi: Data curation, investigation, writing-review and editing. K.B. Huang: Data curation, investigation, writing-review and editing. S.R. Digumarthy: Data curation, investigation, writing-review and editing. M.S. Ginsberg: Data curation, investigation, writing-review and editing. J. Girshman: Data curation, investigation, writing-review and editing. M.G. Kris: Supervision, investigation, writing-review and editing. G.J. Riely: Supervision, investigation, writing-review and editing. A. Yala: Software, supervision, investigation, writing-review and editing. J.F. Gainor: Resources, formal analysis, supervision, validation, investigation, methodology, project administration, writing-review and editing. R. Barzilay: Conceptualization, resources, software, formal analysis, supervision, funding acquisition, validation, investigation, visualization, methodology, writing-original draft, project administration, writing-review and editing. M.D. Hellmann: Conceptualization, resources, formal analysis, supervision, funding acquisition, validation, investigation, visualization, methodology, writing-original draft, project administration, writing-review and editing.

This work was supported by Memorial Sloan Kettering Cancer Center Support Grant/Core grant no. P30-CA008748 and the Druckenmiller Center for Lung Cancer Research at Memorial Sloan Kettering Cancer Center. J. Luo is supported in part by T32-CA009207 and K30-UL1TR00457 grants from the NIH and the Conquer Cancer Foundation of the American Society of Clinical Oncology. M.D. Hellmann and R. Barzilay are supported under a collaboration by a Stand Up To Cancer Convergence Award, a program of the Entertainment Industry Foundation. M.D. Hellmann is a Damon Runyon Clinical Investigator supported in part by the Damon Runyon Cancer Research Foundation Grant no. CI-98-18 and is a member of the Parker Institute for Cancer Immunotherapy.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1.
Murthy
VH
,
Krumholz
HM
,
Gross
CP
. 
Participation in cancer clinical trials: race-, sex-, and age-based disparities
.
JAMA
2004
;
291
:
2720
6
.
2.
Unger
JM
,
Vaidya
R
,
Hershman
DL
,
Minasian
LM
,
Fleury
ME
. 
Systematic review and meta-analysis of the magnitude of structural, clinical, and physician and patient barriers to cancer clinical trial participation
.
J Natl Cancer Inst
2019
;
111
:
245
55
.
3.
Singal
G
,
Miller
PG
,
Agarwala
V
,
Li
G
,
Kaushik
G
,
Backenroth
D
, et al
Association of patient characteristics and tumor genomics with clinical outcomes among patients with non–small cell lung cancer using a clinicogenomic database
.
JAMA
2019
;
321
:
1391
9
.
4.
Khozin
S
,
Abernethy
AP
,
Nussbaum
NC
,
Zhi
J
,
Curtis
MD
,
Tucker
M
, et al
Characteristics of real-world metastatic non-small cell lung cancer patients treated with nivolumab and pembrolizumab during the year following approval
.
Oncologist
2018
;
23
:
328
36
.
5.
Eisenhauer
EA
,
Therasse
P
,
Bogaerts
J
,
Schwartz
LH
,
Sargent
D
,
Ford
R
, et al
New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1)
.
Eur J Cancer
2009
;
45
:
228
47
.
6.
Rizvi
H
,
Sanchez-Vega
F
,
La
K
,
Chatila
W
,
Jonsson
P
,
Halpenny
D
, et al
Molecular determinants of response to anti–programmed cell death (PD)-1 and anti–programmed death-ligand 1 (PD-L1) blockade in patients with non–small-cell lung cancer profiled with targeted next-generation sequencing
.
J Clin Oncol
2018
;
36
:
633
41
.
7.
Arbour
KC
,
Mezquita
L
,
Long
N
,
Rizvi
H
,
Auclin
E
,
Ni
A
, et al
Impact of baseline steroids on efficacy of programmed cell death-1 and programmed death-ligand 1 blockade in patients with non-small-cell lung cancer
.
J Clin Oncol
2018
;
36
:
2872
8
.
8.
Gong
Y
,
Kehl
KL
,
Oxnard
GR
,
Khozin
S
,
Mishra-Kalyani
PS
,
Blumenthal
GM
. 
Time to treatment discontinuation (TTD) as a pragmatic endpoint in metastatic non-small cell lung cancer (mNSCLC): a pooled analysis of 8 trials
.
J Clin Oncol
2018
;
36
:
9064
.
9.
Griffith
SD
,
Miksad
RA
,
Calkins
G
,
You
P
,
Lipitz
NG
,
Bourla
AB
, et al
Characterizing the feasibility and performance of real-world tumor progression end points and their association with overall survival in a large advanced non–small-cell lung cancer data set
.
JCO Clin Cancer Inform
2019
;
3
:
1
13
.
10.
Real-world data (RWD) on tumor response (rwTR) in advanced non-small cell lung cancer (aNSCLC) patients receiving cancer immunotherapy and targeted therapies
.
J Clin Oncol
36
:
15s
, 
2018
(
suppl; abstr 6578
).
11.
Chen
MC
,
Ball
RL
,
Yang
L
,
Moradzadeh
N
,
Chapman
BE
,
Larson
DB
, et al
Deep learning to classify radiology free-text reports
.
Radiology
2018
;
286
:
845
52
.
12.
Lee
K
,
Qadir
A
,
Hasan
SA
,
Datla
V
,
Prakash
A
,
Liu
J
, et al
Adverse drug event detection in tweets with semi-supervised convolutional neural networks
.
Republic and Canton of Geneva, Switzerland
:
International World Wide Web Conferences Steering Committee
; 
2017
.
p.
705
14
.
13.
Bozkurt
S
,
Alkim
E
,
Banerjee
I
,
Rubin
DL
. 
Automated detection of measurements and their descriptors in radiology reports using a hybrid natural language processing algorithm
.
J Digit Imaging
2019
;
32
:
544
53
.
14.
Sevenster
M
,
Bozeman
J
,
Cowhy
A
,
Trost
W
. 
A natural language processing pipeline for pairing measurements uniquely across free-text CT reports
.
J Biomed Inform
2015
;
53
:
36
48
.
15.
AACR Project GENIE Consortium
. 
AACR Project GENIE: Powering precision medicine through an international consortium
.
Cancer Discov
2017
;
7
:
818
31
.
16.
Kehl
KL
,
Elmarakeby
H
,
Nishino
M
,
Allen
EMV
,
Lepisto
EM
,
Hassett
MJ
, et al
Assessment of deep natural language processing in ascertaining oncologic outcomes from radiology reports
.
JAMA Oncol
2019
;
5
:
1421
9
.
17.
Fujimoto
D
,
Yoshioka
H
,
Kataoka
Y
,
Morimoto
T
,
Hata
T
,
Kim
YH
, et al
Pseudoprogression in previously treated patients with non-small cell lung cancer who received nivolumab monotherapy
.
J Thorac Oncol
2019
;
14
:
468
74
.
18.
Abadi
M
,
Barham
P
,
Chen
J
,
Chen
Z
,
Davis
A
,
Dean
J
, et al
TensorFlow: a system for large-scale machine learning
.
12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16)
2016
;
265
83
.
19.
Pennington
J
,
Socher
R
,
Manning
CD
.
GloVe: global vectors for word representation
.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
2014
:
1532
43
.
20.
Srivastava
N
,
Hinton
G
,
Krizhevsky
A
,
Sutskever
I
,
Salakhutdinov
R
. 
Dropout: a simple way to prevent neural networks from overfitting
.
J Mach Learn Res
2014
;
15
:
1929
58
.
21.
Kingma
DP
,
Ba
J
. 
Adam: a method for stochastic optimization
.
arXiv
2017
.