Abstract
To expand nasopharyngeal carcinoma (NPC) screening to larger populations, more practical NPC risk prediction models independent of Epstein–Barr virus (EBV) and other lab tests are necessary.
Patient data before diagnosis of NPC were collected from hospital electronic medical records (EMR) and used to develop machine learning (ML) models for NPC risk prediction using XGBoost. NPC risk factor distributions were generated through connection delta ratio (CDR) analysis of patient graphs. By combining EMR-wide ML with patient graph analysis, the number of variables in these risk models was reduced, allowing for more practical NPC risk prediction ML models.
Using data collected from 1,357 patients with NPC and 1,448 patients with control, an optimal set of 100 variables (ov100) was determined for building NPC risk prediction ML models that had, the following performance metrics: 0.93–0.96 recall, 0.80–0.92 precision, and 0.83–0.94 AUC. Aided by the analysis of top CDR-ranked risk factors, the models were further refined to contain only 20 practical variables (pv20), excluding EBV. The pv20 NPC risk XGBoost model achieved 0.79 recall, 0.94 precision, 0.96 specificity, and 0.87 AUC.
This study demonstrated the feasibility of developing practical NPC risk prediction models using EMR-wide ML and patient graph CDR analysis, without requiring EBV data. These models could enable broader implementation of NPC risk evaluation and screening recommendations for larger populations in urban community health centers and rural clinics.
These more practical NPC risk models could help increase NPC screening rate and identify more patients with early-stage NPC.
Introduction
Nasopharyngeal carcinoma (NPC) is a rare disease worldwide, but it is endemic in southern China, Southeast Asia, North Africa, and the Arctic (1). With the incidence rate of NPC reaching 20 to 40 per 100,000 person-years in southern China, screening for NPC early detection is becoming an important public health issue (2). To achieve effective and efficient NPC screening, identification of NPC risk factors and development of practical predictive models are crucial. Epstein–Barr virus (EBV) infection is a recognized risk factor but is not specific to NPC. Research has identified three EBV subtypes that are associated with high NPC risk (3). By combining the three EBV genetic variants with two human leukocyte antigen SNPs and four epidemiologic risk factors (smoking status, salted-fish consumption, educational level, and family history of NPC) in a logistic regression model, NPC risk prediction performance can be increased significantly (4).
Machine learning (ML) has also been applied to study a small number of biological markers in blood as risk factors affecting the overall survival rate of patients with NPC (5, 6). Hospital patient electronic medical records (EMR) contain thousands of health-related factors. By using new approaches such as ML and knowledge graph (KG), EMR data can be effectively analyzed to build more practical risk prediction models for NPC early detection as well as precision medicine generally.
Unlike traditional statistical risk model approaches, EMR-wide ML represents patients with all variables available in EMR instead of only a limited number of preselected variables. Although the EMR-wide ML approach has not yet been demonstrated for modeling NPC risk, it has been used to model risk in other cancers. For example, one study used over 30,000 features in EMR data to build an XGBoost risk prediction model to identify patients with high-risk lung cancer (7). Another study used deep learning to represent 1.6 million patients with over 57,000 clinical concepts from EMR, which effectively enabled patient stratification at scale (8). Because these ML models require a large number of variables, they may be applicable in inpatient settings where the required patient data are available. However, for outpatient visits in rural or community clinics, fewer patient data are available, so a small number of variables for risk evaluation would be more practical.
Studies have shown that high quality patient KGs can be constructed from EMR data using rudimentary concept extraction and that KG can be used to predict diagnosis based on symptoms (9). Even though graphical representation of patient data shows promise for risk prediction, few studies have used KG to model individual patient disease diagnosis and treatment journey (10). Furthermore, we found no published KG research using NPC patient data. Open-source graph databases and tools have facilitated KG development (11). For instance, a large-scale clinical KG was constructed from biomedical data to interpret proteomics data (12).
Most of the patients with NPC in this study already had late stage disease upon initial presentation. Early stage NPC detection is crucial for improving patient outcomes because the 5-year survival rate is high (> 80%) for early-stage patients, but low (< 50%) for late-stage patients. One major challenge for NPC screening is that many high-risk NPC individuals live in rural areas with limited access to healthcare. As a result, a low cost risk prediction program that could be easily implemented in rural clinics at scale would facilitate NPC screening and the identification of NPC at earlier stages.
The current retrospective study aimed to build practical ML models from EMR data for NPC risk prediction in different clinical settings. We designed a new approach that combines EMR-wide ML with patient risk factor graph analysis to build more practical NPC risk prediction models without requiring EBV and other lab tests. These models could facilitate the implementation of large-scale risk-based NPC screening programs for early detection in both urban and rural areas.
Materials and Methods
EMR data de-identification
This retrospective study of EMR patient data has been approved by the Institutional Review Board of the Guilin Medical University Affiliated Hospital (QTLL202139) and was conducted in accordance with the Declaration of Helsinki. Patient records in the hospital EMR from January 2018 to June 2021 were de-identified and provided on a secured data server managed by the hospital's informatics department. The dataset contained about 1 million patients and 7 million encounters (both inpatient and outpatient), in which patient private information such as patient name, birthday, contact and address were removed. The original identifiers of patients and encounters were replaced with random numbers that were irrelevant to the patients’ identities. Before using the data, our research team members were trained on the hospital's patient data security and privacy policy.
Patient selection
Because the EMR data had no usable codes associated with the diagnoses, NPC synonyms were used to search for patients with NPC. A total of 1,357 patients with NPC age ≥30 were included in the target dataset, while 1,448 patients without NPC and age ≥30 were randomly selected as background or control patients. Our researchers manually checked each patient's medical records and confirmed that the 1,357 patients with NPC had final NPC diagnosis made by physicians.
Local standardized data collection
De-identified records of outpatient and inpatient visits, diagnoses, lab tests and procedures were imported into a custom data collection tool on a secured data server. The data tool automatically extracted lab test data and saved them to a database. Researchers manually selected data from text records and entered them into the database. Data were grouped into 9 categories: disease and condition, symptom, medical history, observation, lab test, procedure, medication, treatment, and any other risk factor. Because the records were not coded and lacked standardization, practical rules were developed to improve consistency in the data collection process. Synonyms were automatically converted to local standard terms. For each patient with NPC, only data before the final diagnosis of NPC were collected for the task of studying NPC risk prediction, and a patient diagnosis journey (PDJ) object was created in the data tool containing one or more encounters leading to the final diagnosis. For each background patient, all encounters within the 3.5-year period were included. When exporting PDJ data to a CSV file for analysis, only the latest data of each health factor in PDJ were selected.
Value conversion for patient graph
To simplify the patient health factor graph, continuous numeric data were converted to categorical data. For example, single-year ages were converted to categories: 30–50, 50–70, > 70 years old; alcohol: 0–2, > 2 drinks per day; smoking: 0, 1–20, > 20 cigarettes per day. Lab test results from the EMR were already in categories such as normal/abnormal, true/false, positive/negative, high/medium/low, up/down/normal, etc. After conversion, standardized data were saved in two CSV files for import: a patient import file and a factor import file. The patient import file contained one patient's data per line with the following format: virtual-id, NPC-label (1 for NPC, 0 for background), and factor-count. The number of background patients was reduced to the same number of patients with NPC (1,357). The factor import file contained data associated with one factor per line with the following format: virtual-id, category, code, term, value, unit, converted-value, and date. Every health factor observed among at least 10 patients with NPC was selected for graph analysis, resulting in a total of 254 health factors.
Patient graph database
The Neo4j Desktop tool (v4.4), available freely from Neo4j Inc. (San Mateo, California), was used to perform patient graph analysis. In our patient-centric graph model, each patient was represented by a “Patient” node, while health factors of different categories were represented by nodes with different labels including Condition, Symptom, Observation, History, RiskFactor, Labtest, Procedure, Medication, and Treatment. Each label had specific constraints to ensure uniqueness. For example, the patient node required virtual-id while the factor node required a combination of category, code and converted-value as the node key. Data in the patient and factor import files were loaded into Neo4j graph database to construct patient factor graphs, which had 2,714 patient nodes (1,357 patients with NPC, 1,357 background patients), 334 factor nodes, and over 52,000 relations.
Generation of NPC health factor distribution
We applied a new patient graph method for risk factor analysis, which was first developed with synthetic patient data and found to be capable of generating a distribution of disease health factors directly for EMR. We developed a python script to automatically query the patient factor graph with each of the NPC health factors. The connections from each factor to patients with NPC and background patients in the search result graph were counted. For each factor, the delta of connection counts was calculated by subtracting the background patient connection count (BPC) from the target patient connection count (TPC). We defined “connection delta ratio” (CDR) as (TPC−BPC)/(TPC + BPC), which represents the relative strength of connections between any given factor and the target NPC disease.
Sorting factors by CDR in descending order produced a distribution of NPC health factors from high to low relative strength. Factors with CDR greater than 0.3 and connected to more than 10 patients were selected for literature verification. Each factor was tagged “confirmed” as a confirmed risk factor, “correlated” as having any correlation with NPC, “unrelated” as no correlation, and “unsure” as inconclusive correlation. If a factor was neither confirmed nor correlated with NPC, despite being associated with enough (>100) patients with NPC, we tagged it as “cdr-suggested”, meaning the factor was suggested by CDR analysis for further study.
ML for NPC risk prediction models
The ML data table had 2,805 patients and 3,244 variables. To select features for ML, all variables were sorted by the number of patients with NPC they appeared in. Codes associated with at least 20 patients with NPC were pooled, from which different features were selected to form different sets of variables for different ML experiments. For studying NPC risk prediction, all cancer related disease, procedure, medication and treatment data were excluded. All diagnostic imaging procedures commonly performed on patients with cancer but not on background patients were also excluded. Similarly, all lab tests for cancer biomarkers as well as EBV were excluded because they were commonly done in patients with NPC but not in background patients, which would otherwise introduce bias in data.
In developing ML models, we used the XGBoost python library freely available from the xgboost.readthedocs.io website. XGBoost uses parallel tree boosting and handles missing data well (13). The python library scikit-learn from scikit-learn.org was used to perform all other ML tasks. The free Jupyter Notebook tool was used to conduct ML experiments. The Padas library was used to read/write CSV files and manipulate data tables. Using scikit-learn's train_test_split() function, the dataset was split into training (60%), tuning (20%) and test (20%) subsets in two steps. The test set was only used to evaluate model performance independently of the training and tuning sets. With default hyperparameters, the XGBoost classifier was fitted with the training and tuning sets, and then tested independently in the test set. In making binary classifications, the scikit-learn prediction function converted the predicted probability to a binary label (0 or 1) via the default classification threshold of 0.5. Model performance was measured via recall (or sensitivity), precision, ROC-AUC and accuracy. ROC and reliability curves were drawn by calling the corresponding scikit-learn functions.
After determining an optimal set and a practical set of variables, XGBoost was compared with three other common ML algorithms: Random Forest (RF), Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). These algorithms were run using scikit-learn classifiers with default parameters.
To develop more practical models, smaller numbers of variables were selected and consolidated according to the NPC health factor distribution created above. The ML data table was also consolidated to build practical models.
For the final XGBoost models, additional performance metrics were calculated via confusion matrix: sensitivity, specificity, false positive rate, and false negative rate. The 95% confidence interval (CI) was determined via bootstrap with 1,000 resamples for each of the metrics.
Data availability
The patient datasets used and/or analyzed during the current study are not available due to patient data privacy considerations. Other data without privacy concerns are available from the corresponding author upon reasonable request.
Results
Construction and search of NPC patient health factor graph
The basic characteristics of the patients with NPC were compared with those of the background patients (Supplementary Table S1). There were much more males (71.6%) than females (28.4%) in the NPC patient population. Many NPC-related symptoms occurred in patients with NPC but not in patients with non-NPC.
To more systematically study all possible risk factors relevant to NPC, we applied a new patient graph analysis to NPC. The patient graph model (Supplementary Fig. S1) was designed to focus on the different categories of health factors associated with the patients. Cypher query language was used to search for patients with NPC with any number of health factors in the patient graph. Some example graph search queries are listed in Supplementary Table S2. One example topology of patients with NPC with five co-occurring diseases is shown in Fig. 1. Example topology of patients with NPC resulting from searching five non-lab factors is shown in Supplementary Fig. S2. Example topology of patients with NPC resulting from searching five lab factors is shown in Supplementary Fig. S3. The patient graph analysis below used this type of topology in calculating patient–factor relationship strength. In the patient graph, connections between patients and factors were represented by lines, with a higher concentration of lines denoting stronger patient–factor relationships.
Example topology of an NPC patient graph with patient nodes (blue) and 5 co-occurring disease factor nodes (red). Each line connects an NPC patient to a patient's co-occurring disease. This topology shows how patients were distributed among a variety of diseases that might coexist in patients. The patient graph was generated by Cypher query #2 listed in Supplementary Table S2.
Example topology of an NPC patient graph with patient nodes (blue) and 5 co-occurring disease factor nodes (red). Each line connects an NPC patient to a patient's co-occurring disease. This topology shows how patients were distributed among a variety of diseases that might coexist in patients. The patient graph was generated by Cypher query #2 listed in Supplementary Table S2.
Distribution of NPC health factors from patient graph analysis
Using patient graphs containing the same number of patients with NPC and background patients as well as factor-to-patient relations, we calculated the CDR to represent the relative connection strength of any given factor for patients with NPC versus patients with non-NPC. Supplementary Table S3 shows the top CDR-ranked health factors by category to further understand known risk factors and identify potential new risk factors. The distribution curve of health factors by CDR is shown in Supplementary Fig. S4.
Because we were interested in NPC risk factors, the distribution excluded all diagnostic procedures and medications. Most of the top factors on this distribution list were verified by literature review to be either confirmed NPC risk factors or factors correlated with NPC (14, 15). They included positive lab results for several tumor biomarkers and EBV (16–20), commonly known NPC symptoms and observations, and nose or throat related diseases (21–23). However, the relationships of two factors with NPC were determined to be unsure: alpha-2 antiplasmin functional assay, and maximum systolic blood pressure. Another two factors, non–small cell lung cancer-associated antigen, and lung nodules, were tagged with “cdr-suggested” for future NPC-factor association study.
EMR-wide ML for NPC risk prediction models
Following the identification of NPC health factors from EMR patient graph analysis, we performed ML on a total of 75,000 data points collected from patients with NPC (1,357) and background patients (1,448) using the XGBoost algorithm on default settings. We first built XGBoost NPC risk prediction models from different numbers of non-lab health factors of over 3,000 factors. The recall of NPC increased from 0.67 to 0.80 as the number of variables increased from 10 to 53 (Supplementary Table S4 and Supplementary Fig. S5).
We then added other health factors, including lab tests and observations in diagnostic procedures, to the variable sets. As the total number of variables reached over 100, the recall of NPC increased to over 0.93, while the ROC-AUC measurement and accuracy plateaued at a level of 0.93 (see Table 1; Fig. 2). Based on our definition of optimal variable set, i.e., having a smaller number of variables while delivering higher recall and precision, the set of 100 variables was determined to be optimal (ov100; see Supplementary Table S5).
Performance metrics of XGBoost models for NPC risk prediction with different numbers of variables.
Number of variablesa . | 10 . | 20 . | 30 . | 50 . | 100 . | 163 . |
---|---|---|---|---|---|---|
Recallb | 0.674 | 0.862 | 0.858 | 0.881 | 0.939 | 0.966 |
Precision | 0.951 | 0.83 | 0.892 | 0.924 | 0.921 | 0.894 |
ROC-AUC | 0.822 | 0.854 | 0.884 | 0.909 | 0.934 | 0.933 |
Accuracy | 0.832 | 0.854 | 0.886 | 0.911 | 0.934 | 0.93 |
Number of variablesa . | 10 . | 20 . | 30 . | 50 . | 100 . | 163 . |
---|---|---|---|---|---|---|
Recallb | 0.674 | 0.862 | 0.858 | 0.881 | 0.939 | 0.966 |
Precision | 0.951 | 0.83 | 0.892 | 0.924 | 0.921 | 0.894 |
ROC-AUC | 0.822 | 0.854 | 0.884 | 0.909 | 0.934 | 0.933 |
Accuracy | 0.832 | 0.854 | 0.886 | 0.911 | 0.934 | 0.93 |
aVariable categories: disease, symptom, medical history, risk factor, lab test, and observation.
bModels were trained using default settings. Prediction for evaluating model performance used the default classification threshold of 0.5 for converting probability to binary labels.
Trends of NPC risk prediction XGBoost model performance as the number of variables in all categories increase. Data used is shown in Table 1. Key performance metrics: recall, precision, and ROC-AUC. Models were trained with the default settings. Prediction for evaluating model performance used the default classification threshold 0.5 for converting probability to binary labels.
Trends of NPC risk prediction XGBoost model performance as the number of variables in all categories increase. Data used is shown in Table 1. Key performance metrics: recall, precision, and ROC-AUC. Models were trained with the default settings. Prediction for evaluating model performance used the default classification threshold 0.5 for converting probability to binary labels.
Using the ov100 variable set, the XGBoost algorithm was compared with other common algorithms including RF, SVM, and KNN. As shown in Table 2, RF had the highest recall (0.969) while XGBoost had the highest precision (0.921). SVM performed similarly to RF and XGBoost. KNN showed the poorest performance. Therefore, all three XGBoost, RF, and SVM ov100 models demonstrated high performance for NPC risk prediction with a recall of 0.93–0.96, precision of 0.88–0.92, ROC-AUC of 0.92–0.94, and accuracy of 0.91–0.93 (Supplementary Fig. S6). The ROC curve and the reliability curve of the ov100 XGBoost model are shown in Fig. 3A and B, respectively. With bootstrapped CIs, the performance metrics for the ov100 XGBoost model were: 0.955 (0.923, 0.973) recall or sensitivity, 0.895 (0.871, 0.935) precision, 0.928 (0.918, 0.939) ROC-AUC, 0.927 (0.916, 0.939) accuracy, and 0.902 (0.877, 0.943) specificity (Supplementary Table S6).
Performance comparison for NPC risk prediction ML models of different algorithms using the optimal or practical set of variables.
Algorithma . | XGBoost . | RF . | SVM . | KNN . |
---|---|---|---|---|
Using the optimal set of 100 variables (ov100): | ||||
Recallb | 0.939 | 0.969 | 0.954 | 0.900 |
Precision | 0.921 | 0.907 | 0.880 | 0.768 |
ROC-AUC | 0.934 | 0.941 | 0.920 | 0.832 |
Accuracy | 0.934 | 0.939 | 0.918 | 0.827 |
Using the practical set of 20 variables (pv20): | ||||
Recall | 0.797 | 0.789 | 0.816 | 0.774 |
Precision | 0.945 | 0.941 | 0.934 | 0.944 |
ROC-AUC | 0.878 | 0.873 | 0.883 | 0.867 |
Accuracy | 0.884 | 0.879 | 0.888 | 0.873 |
Algorithma . | XGBoost . | RF . | SVM . | KNN . |
---|---|---|---|---|
Using the optimal set of 100 variables (ov100): | ||||
Recallb | 0.939 | 0.969 | 0.954 | 0.900 |
Precision | 0.921 | 0.907 | 0.880 | 0.768 |
ROC-AUC | 0.934 | 0.941 | 0.920 | 0.832 |
Accuracy | 0.934 | 0.939 | 0.918 | 0.827 |
Using the practical set of 20 variables (pv20): | ||||
Recall | 0.797 | 0.789 | 0.816 | 0.774 |
Precision | 0.945 | 0.941 | 0.934 | 0.944 |
ROC-AUC | 0.878 | 0.873 | 0.883 | 0.867 |
Accuracy | 0.884 | 0.879 | 0.888 | 0.873 |
aAlgorithms: XGBoost, RF, SVM, and KNN.
bModels were trained using default settings. Prediction for evaluating model performance used the default classification threshold 0.5 for converting probability to binary labels.
ROC and reliability curves of XGBoost models for NPC risk prediction. Models were trained using default settings. A, ROC curve using the optimal 100 variables (ov100). B, Reliability curve using the ov100 variables. C, ROC curve using the practical 20 variables (pv20). D, Reliability curve using the pv20 variables.
ROC and reliability curves of XGBoost models for NPC risk prediction. Models were trained using default settings. A, ROC curve using the optimal 100 variables (ov100). B, Reliability curve using the ov100 variables. C, ROC curve using the practical 20 variables (pv20). D, Reliability curve using the pv20 variables.
Practical NPC risk prediction models
Because the pv100 model requires lab tests and diagnostic procedural observations, it may not be practical for implementation in outpatient settings like community or rural clinics. We thus selected 35 non-lab variables according to the health factor distribution created above to build a smaller XGBoost model, which still had acceptable performance: recall of 0.78, precision of 0.95, ROC-AUC of 0.86 and accuracy of 0.88. In particular, the EBV lab test was excluded to reduce dependency on lab tests. Further consolidation of the related factors resulted in a practical set of 20 variables (pv20), as listed in Table 3.
Variables in the practical variable set (pv20) for the practical NPC risk prediction ML model.
Category . | Variable local code . | Variable term . |
---|---|---|
Observation | C-nasal-sinus-cyst | Nasal sinus cyst |
Disease | C-sinusitis | Sinusitis |
Disease | C-rhinitis | Rhinitis |
Disease | C-nasopharyngitis | Nasopharyngitis |
Disease | C-pharyngitis | Pharyngitis |
Disease | C-middle-ear-dac | Middle ear diseases |
History | C-middle-ear-repair | History of Middle ear repairing |
Observation | C-lung-mass | Lung mass |
Symptom | C-nose-smp | Nose symptoms |
Symptom | C-nose-bleed | Nose bleeding |
Observation | C-nose-mass | Nose mass |
Symptom | C-ear-smp | Ear symptoms |
Symptom | C-eye-smp | Eye symptoms |
Symptom | C-cough-blood | Coughing blood |
Symptom | C-neck-mass | Neck mass |
Symptom | C-neck-pain | Neck pain |
Symptom | C-voice-smp | Voice symptoms |
Symptom | C-face-smp | Face symptoms |
Symptom | C-head-smp | Headache |
Symptom | C-throat-smp | Throat symptoms |
Category . | Variable local code . | Variable term . |
---|---|---|
Observation | C-nasal-sinus-cyst | Nasal sinus cyst |
Disease | C-sinusitis | Sinusitis |
Disease | C-rhinitis | Rhinitis |
Disease | C-nasopharyngitis | Nasopharyngitis |
Disease | C-pharyngitis | Pharyngitis |
Disease | C-middle-ear-dac | Middle ear diseases |
History | C-middle-ear-repair | History of Middle ear repairing |
Observation | C-lung-mass | Lung mass |
Symptom | C-nose-smp | Nose symptoms |
Symptom | C-nose-bleed | Nose bleeding |
Observation | C-nose-mass | Nose mass |
Symptom | C-ear-smp | Ear symptoms |
Symptom | C-eye-smp | Eye symptoms |
Symptom | C-cough-blood | Coughing blood |
Symptom | C-neck-mass | Neck mass |
Symptom | C-neck-pain | Neck pain |
Symptom | C-voice-smp | Voice symptoms |
Symptom | C-face-smp | Face symptoms |
Symptom | C-head-smp | Headache |
Symptom | C-throat-smp | Throat symptoms |
For comparison, pv20 models were built using XGBoost, RF, SVM and KNN algorithms. As shown in Table 2, the pv20 XGBoost, RF, and SVM models performed similarly, with 0.80 recall, 0.93 precision, 0.87 ROC-AUC, and 0.88 accuracy. The ROC curve and the reliability curve of the XGBoost prediction model with pv20 are shown in Fig. 3C and D. The pv20 XGBoost model's complete metrics with 95% CI generated via bootstrap were 0.789 (0.766, 0.805) recall or sensitivity, 0.941 (0.929, 0.950) precision, 0.873 (0.863, 0.881) ROC-AUC, 0.879 (0.870, 0.886) accuracy, and 0.957 (0.947, 0.963) specificity (Supplementary Table S6).
Discussion
By using a new strategy of combining EMR-wide ML with patient graph analysis, this retrospective study was able to develop an optimal set of 100 variables for NPC risk prediction ML models suitable for inpatient settings, and more importantly, a more practical set of 20 variables, for which data could be readily obtained in outpatient settings. The NPC pv20 risk prediction XGBoost model had key metrics of 0.789 recall, 0.941 precision, 0.873 ROC-AUC, 0.879 accuracy, and 0.957 specificity. These risk models could enable significant improvements in NPC risk evaluation, screening and early detection.
Current NPC population screening programs, such as those used in China, usually involve an EBV blood test and genetic tests (4). To improve the accessibility of NPC risk prediction, the pv20 model intentionally excludes EBV evaluation and only includes factors that can be easily obtained through a patient questionnaire. This simplicity makes it possible for the pv20 model to be deployed for NPC screening and early detection in larger populations.
The data-driven analysis of patient graphs revealed two health factors that have not yet been confirmed as NPC risk factors in the literature. The factors “non–small cell lung cancer-associated antigen” and “Lung nodules” had >100 connections with NPC patient nodes on the graph and were assigned “cdr-suggested” status for future study. Selecting “lung nodules” as a variable for the pv20 model was unexpected. Because lung nodules are common, physicians at community or rural clinics may not pay enough attention to their link to NPC. Another surprising variable in the pv20 risk model was middle ear issues. Having lung nodules and middle ear issues in the practical NPC risk model will help alert physicians to their association with NPC risk. All other variables in the pv20 model were not surprising because they were associated with diseases or symptoms of the nose, sinus, nasopharynx, throat, eyes, face, head, or neck.
In addition to its relevance to traditional population-based screening operations, the practical NPC pv20 model could be embedded into the routine clinical workflow of a learning health system (LHS) to identify at-risk patients and provide recommendations for NPC clinical screening. LHS is a concept created by the US National Academy of Medicine for healthcare system transformation (24, 25). A LHS for screening could continuously learn from new screening data and improve the pv20 model's risk prediction over time. Furthermore, hospitals and clinics could collaborate with each other within a clinical research network (CRN), further strengthening new screening program efforts.
Our previous research has established the feasibility of developing a scalable CRN with rural clinics participating on a digital platform (26). The current study has developed another key component—a NPC risk prediction model for which required input data can be obtained from rural clinics for immediate risk prediction and screening recommendations. In future studies of a CRN-LHS for NPC early detection, leading hospitals should provide not only a risk prediction tool based on the pv20 model but also training to primary care clinics in the CRN. A LHS would enable rural clinicians to predict NPC risk for their patients and significantly improve the early detection of NPC.
Collecting standardized data from EMRs is key to conducting ML research. However, most EMRs, particularly in developing countries, may lack standard codes and structured data. Given the variations in patient data due to the differences in clinical services, hospitals and EMR systems, special attention should be given to data biases and missing data. The EMR-wide ML is a data-centric approach, which has the advantage of being able to build high-performance predictive models from “messy” EMR data. However, this method also has certain limitations. First, it requires collecting data on all health factors existing in the EMR that can be time-consuming if data collection is not fully automated. Second, if data bias and missing data issues are not handled correctly, the models generated may be unreliable. Third, using EMR-wide data may introduce location-specific bias, limiting the models’ applicability to different hospitals.
This study handled the problem of missing data using the following techniques:
(1) An empirical process to optimize feature selection, i.e., find the optimal number of variables for ML models that achieve higher performance metrics with fewer features. This involves testing different models with different sets of variables.
(2) An algorithm like XGBoost that can tolerate lots of missing data and comparison with other algorithms, such as RF and SVM to select the best one.
(3) Generate a reliability curve to check and prevent overfitting.
As demonstrated in this study, combining EMR-wide ML with patient graph CDR analysis represents an effective new approach to develop practical ML models, without preselecting a small number of known risk factors, which may be applied to other diseases as well. This new approach has two steps: (1) build a large model with as many variables as possible from EMR-wide data, and (2) use a distribution of risk factors generated from EMR patient graph CDR analysis to trim the number of variables to a small set practical for implementation in lower-resource settings. However, caution should be used when a factor has high CDR but only a small number of factor-to-patient connections. The smaller the number of connections, the less reliable the factor.
In conclusion, the new approach of EMR-wide ML in combination with EMR patient risk factor graph analysis was effective in developing NPC risk prediction models. While the resulting NPC ov100 XGBoost model would be suited for identifying high-risk patient populations in hospitals, the NPC pv20 XGBoost model would be more practical for implementing large-scale screening programs for NPC early detection than existing EBV-required risk models.
Authors' Disclosures
No disclosures were reported.
Authors' Contributions
A. Chen: Conceptualization, methodology, writing–original draft. R. Lu: Data curation. R. Han: Data curation. R. Huang: Software, formal analysis. G. Qin: Data curation. J. Wen: Supervision, project administration. Q. Li: Supervision, funding acquisition. Z. Zhang: Supervision, funding acquisition. W. Jiang: Supervision, funding acquisition, investigation.
Acknowledgments
The authors would like to thank Xiaowang Chen of the Department of Medical Information at the Guilin Medical University Affiliated Hospital for supporting the EMR data server and providing privacy training.
This work was supported by Guilin Municipal Science and Technology Bureau, China (grant number 20190219–2) to Q. Li, Sichuan Provincial Science and Technology Program, China (grant number 2020YFQ0019) to A. Chen, the Natural Science Foundation Key Projects of Guangxi (grant number 2018GXNSFDA050021) to W. Jiang, and the National Natural Science Foundation of China (grant number 82160479) to W. Jiang.
The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Note: Supplementary data for this article are available at Cancer Epidemiology, Biomarkers & Prevention Online (http://cebp.aacrjournals.org/).