Background:

There is tremendous potential to leverage the value gained from integrating electronic health records (EHR) and population-based cancer registry data for research. Registries provide diagnosis details, tumor characteristics, and treatment summaries, while EHRs contain rich clinical detail. A carefully conducted cancer registry linkage may also be used to improve the internal and external validity of inferences made from EHR-based studies.

Methods:

We linked the EHRs of a large, multispecialty, mixed-payer health care system with the statewide cancer registry and assessed the validity of our linked population. For internal validity, we identify patients that might be “missed” in a linkage, threatening the internal validity of an EHR study population. For generalizability, we compared linked cases with all other cancer patients in the 22-county EHR catchment region.

Results:

From an EHR population of 4.5 million, we identified 306,554 patients with cancer, 26% of the catchment region patients with cancer; 22.7% of linked patients were diagnosed with cancer after they migrated away from our health care system highlighting an advantage of system-wide linkage. We observed demographic differences between EHR patients and non-EHR patients in the surrounding region and demonstrated use of selection probabilities with model-based standardization to improve generalizability.

Conclusions:

Our experiences set the foundation to encourage and inform researchers interested in working with EHRs for cancer research as well as provide context for leveraging linkages to assess and improve validity and generalizability.

Impact:

Researchers conducting linkages may benefit from considering one or more of these approaches to establish and evaluate the validity of their EHR-based populations.

See all articles in this CEBP Focus section, “Modernizing Population Science.”

Cancer research using electronic health records (EHR) may provide important advantages over the use of cancer registries alone for cancer population health research (1–4). EHRs collect longitudinal data related to health behaviors (e.g., BMI, current smoking status), and preventive care services (5). Appropriate use of EHRs for research can facilitate development of longitudinal studies of environmental or behavioral risk factors, or cancer outcomes after routine screening (6–12). However, EHRs are often hampered by the lack of definitive cancer details in coded fields (13, 14). Using only EHRs, accurate identification of cancer cases can be difficult and important characteristics needed to describe a cancer population are often in scanned documents or freetext notes (15, 16). A cancer registry linkage, which uses identifying information to match EHR patients with the registry, is a solution for researchers to identify patients wih cancer and obtain their definitive tumor characteristics (17). Population-based cancer registries are mandated by federal and state law (18), and collect data uniformly on a defined catchment population, while EHRs only collect data reflecting patient care and billing. Registries consolidate information for a given case from multiple sources, and the important data items are also frequently cleaned, adjudicated, and carefully prepared for surveillance activities (19, 20).

Registry linkages can be costly and time consuming, and the mechanics of the linkage are often designed with both in mind. “Targeted” linkages begin with an EHR data mining step, designed to identify any potential cancer patients by searching for relevant codes (i.e., cancer-specific ICD or CPT codes) and creating a list of suspected patients with cancer for linkage (21). Such a targeted search strategy is sufficient for many types of research, and it may be especially reliable for identifying cancers in a single-payer health care provider or when patient migration for cancer treatment is unlikely. In a provider environment in which patients are free to seek care across multiple health systems, a targeted linkage may be insufficient. For example, in a study designed to evaluate the relationship between routine cancer screening and downstream cancer outcomes, a patient migrating to a new system between the screening and the cancer diagnosis could be overlooked by a targeted linkage; this patient would be misclassified as cancer-free in an analysis. An alternative to a targeted linkage is a system-wide linkage, in which all EHR patients are matched with the registry, regardless of evidence of cancer in their EHRs, which should result in the ascertainment of most, if not all, cancer cases for an entire health care system. The system-wide approach additionally serves as a rigorous method for identifying confirmed noncancer cases, that is, for the selection of controls in case–control studies.

EHRs are an example of “real world data” (i.e., observational health care data initially collected for nonresearch purposes; refs. 22–24) and research use of such data risks numerous potential threats to validity (25–31). Two such threats are: (i) bias due to systematic exclusion of eligible subjects in an EHR population, and (ii) bias due to limited generalizability of the EHR to the source population. Broadly, both of these can be viewed as possible sources of selection bias (32), but a key distinction is one of internal versus external validity. Systematic exclusion of eligible patients in an EHR population is a threat to internal validity, that is, the inference made from the study results may deviate from the true relationship in the EHR population. Inability to generalize the study results to a target population due to nonrandom sampling is a threat to external validity, that is, even if the association derived from the EHR population is internally valid, the inference will not be generalizable to the source population because EHR-based populations in the U.S. health care system are usually a convenience sample of persons who happen to go to a specific place for health care. When known, sampling fractions are routinely used to improve the generalizability of study findings (33). For example, if researchers know that their study population differs from the source population by a specific demographic characteristic, they may standardize their results to the demographic distribution (32). If the selection factors are related to a vector of characteristics, model-based standardization may be used to reweight analyses to the multivariate covariate distribution of the source population (32, 34).

In this article, we describe a system-wide cancer registry linkage undertaken for the adult patient population of a large multispecialty, mixed payer health care delivery system located in Northern California. Using a validation sample, we demonstrate what was gained/lost using the system-wide approach over a targeted linkage as a check of internal validity of this approach. We also evaluate the external validity of the linked cancer cohort by comparing it to the surrounding catchment region. Finally, we demonstrate the use of model-based standardization to adjust for improved generalizability.

EHR study population

Sutter Health is a large multispecialty health care system serving 22 northern California counties, with more than 4 million patients and 10 million outpatient visits per year. The patient population is diverse: 10% Hispanic, 19% Asian American, Native Hawaiian and Pacific Islander (AANHPI), 21% Black, and a payer mix of 42% PPO, 30% HMO, 23% Medicare/Medicaid, 3% self-payers, and 2% other payers. The EpicCare EHR system, (Epic Systems Corporation), is used to collect details of all patient encounters, including laboratory results, procedures, medication orders, diagnoses, immunizations, radiologic reports, and routine testing, as well as demographics, medical and surgical history, and additional transactional detail about care utilization (dates and times, providers seen, etc.).

California Cancer Registry

The California Cancer Registry (CCR) is the National Cancer Institute (NCI)–funded Surveillance, Epidemiology, and End Results (SEER) Program statewide cancer registry. The CCR monitors the occurrence of all types of cancer (excluding non-melanoma skin cancers) in California, including both new diagnoses and deaths. The CCR includes detailed demographic information, tumor characteristics, and specific details of the first course of treatment for all individual cancer cases occurring in California since 1988. The CCR also geocodes all patients with cancer based on home address at time of cancer diagnosis and ascertains follow-up information for long-term survival. The CCR includes data on 5.8 million cancer cases overall, adding 190,000 new cases per year.

System-wide registry linkage

For the system-wide EHR-CCR linkage, we compiled two “finder files” for comparison. From the EHR, we extracted identifying information for all unique adult (≥18 years old) patients in the EHR (regardless of cancer history). From the CCR, we included all unique individuals diagnosed with cancer in the state of California between 1988 and 2013 (based on the availability of data in the CCR at the time of the study activities).

We used LinkPlus software to conduct a probabilistic linkage of the finder files based on patient name, last 4 digits of the social security number, birthdate, and sex. LinkPlus returns a score, based on the probability of the linkage being a match. On the basis of a test run of 15,700 randomly sampled matches, we selected a cut-off score of 21.5 and above as a match, which corresponded with a probability of 99.6%. We manually reviewed matches to determine match status for scores between 21.1 and 21.4 (true match % between 65.1% and 80.4%). Scores below 21.1 were not considered as matches. After linkage, we retained three files for further analyses: (i) the EHR cancer patients, that is, the linked patients and their tumor details, (ii) the EHR cancer-free patients, and (iii) the non-EHR cancer patients, that is, the demographic and tumor details for all nonlinked patients with cancer who were residing in the 22-county Sutter Health catchment region of Northern California at the time of their diagnosis; the third group was explicitly obtained for establishing the external validity of our the EHR cancer population (Fig. 1.)

Validation study

A validation study was undertaken to assess: (i) the added benefit of the system-wide (vs. a targeted) linkage approach for identifying patients with cancer (“internal validity”) and (ii) the generalizability of the EHR cancer cohort established with the linkage (“external validity”). For the validation study, we included only first tumors (excluding secondary primary tumors) of the lung/bronchus, colon/rectum, female breast, and prostate in the catchment region diagnosed in 2012 or 2013 (Fig. 2) with the sites chosen based on cancers of higher prevalence and years chosen based on availability of EHR data at all five medical foundations.

For the internal validity study, we used the EHR-matched portion of the validation study and searched all available date ranges in the following EHR tables: medical history, problem list, charges, encounters, medication orders, and laboratory orders for evidence of ICD9 codes that pertain to malignant tumors of the lung/bronchus (162.0–162.9), colon/rectum (153.0–153.9, 154.0–154.1), female breast (174.0–174.9), prostate (185), or unspecified site (198.81–198.82, 198.89, 199.0–199.1), carcinoma in situ (230.3–230.4, 233.0–233.4, 234.9), or history of cancer (V10.00, V10.06, V10.09, V10.11, V10.3, V10.6). This allowed us to identify two subsets of the EHR cancer patients: (i) those with evidence of cancer in their EHR, and (ii) those with no evidence of cancer in their EHR. The latter group we presume would not have been identified in a targeted linkage. We compared these groups' demographic and tumor characteristics. For EHR cancer patients who did not have evidence of cancer in their EHRs, we further stratified the population by the timing of their EHR encounters in relationship to their CCR-provided cancer diagnosis date, and evidence of a primary care provider (PCP) assignment. We used the following CCR-provided variables for comparing between the three groups: patient sex, age group (18–54, 55–64, 65–74, 75–84, 85+), race/ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic, Asian American/Pacific Islander, Native American), quintiles of neighborhood socioeconomic status (nSES) derived at the block-group level (35), cancer stage (in situ, localized, regional or remote), primary patient insurance (private, Medicare, any public/Medicaid/military), and marital status (married/registered, single/divorced/widowed).

For the external validity study, we redefined the EHR cancer patient population to include only patients who had EHR evidence of care during their cancer episode (defined as 90 days prior to and up to 365 days after the CCR-provided date of cancer diagnosis). EHR cancer patients who did not have any care during this timeframe were reclassified as non-EHR cancer patients (Fig. 2). By additionally including all nonlinked catchment region patients with cancer in the non-EHR cancer patients, the total external validation sample is thus equivalent to the underlying source population (patients with cancer diagnosed with first primary sites of interest, in 2012–2013, living in 22 Northern CA counties). We compared the two populations (EHR and non-EHR) by county, and according to the characteristics described above, and we calculated ratios of proportion of EHR to non-EHR patients by category, with 95% confidence intervals.

Model-based standardization

To demonstrate the use of model-based standardization to reweight the EHR population based on the covariate distribution of the source population, we used inverse probability of selection weighting (IPSW) to adjust model-based estimates for the relationship between later stage at diagnosis (defined as regional/remote vs. in situ/localized) and four exposure variables: nSES quintile, patient race/ethnicity, marital status (married vs. not married) and any public insurance type (vs. private insurance or Medicare). For this demonstration, we excluded unknown values for covariates and patients with very rare covariate values. The assumed data generating mechanism for the relationships modeled, including selection into the EHR population, are depicted in a directed acyclic graph (DAG; Fig. 3). DAGs are nonparametric probabilistic diagrams that depict presumed causal relationships and can be used to identify “biasing pathways” that inhibit valid causal inference and to select variables for confounding control (36). In Fig. 3, bias occurs because the selection node, which is predicted by the exposure, outcome, and other covariates in the causal model, is a “collider variable,” and conditioning on a collider creates “collider stratification bias” (37). This type of bias can be mitigated by adjusting for any covariates that also predict selection, but if the exposure and outcome also predict selection, adjustment may not be sufficient to control all the bias (38). If it is possible to quantify the selection mechanism, that is, through a validation study, the biasing pathways may be blocked by reweighting the outcome model in a procedure called inverse probability of selection weighting (IPSW; refs. 39, 40).

The proof and assumptions required for IPSW have been described elsewhere (34, 39). Briefly, let X be our exposure of interest, Y our binary outcome, S a binary selection node (where S = 1 for the EHR population and S = 0 for non-EHR population), Z a vector of confounding variables that are common causes of X, Y, and S, and C is any additional variables that are useful for predicting S, but do not confound the XY relationship. In our study, C was chosen as the county-specific prevalence of EHR patients (see Supplementary Table S1). We began by modeling the conditional probability of selection |P(S\ = \ 1|y,x,{\bm{z}},c)$| as a logistic regression model including all two-way product terms:

This model was run for all study participants and the output predicted probabilities were used to calculate individual weights with the marginal probability of the exposure of interest used as the numerator of the stabilized weight (sw), where:

For each exposure of interest, we then compared the outcome model parameter estimates calculated three ways: in (i) the entire sample, (ii) the EHR sample only, and (iii) the EHR sample reweighted by sw. All outcome models were adjusted for sex and continuous age. Additional covariates were included to close all biasing pathways, while avoiding adjustment for intermediate variables (32). For example, nSES was included as a covariate in the model for the relationship between marital status and late stage at diagnosis, but not in the model for race/ethnicity and stage, because nSES is an intermediate on the pathway from race/ethnicity to stage. All descriptive and analytic statistics were generated with SAS Enterprise Guide version 10.1 (SAS Institute).

System-wide linkage

Our linkage “finder files” included N = 4,816,898 unique adult EHR patients and N = 3,350,288 unique CCR patients. The linkage identified a total of N = 306,554 Sutter patient matches (group 1), with N = 169 likely being duplicate (because multiple EHR patients matched the same CCR patient ID). There were N = 840,974 CCR patients residing in the catchment region (group 2) who were not in the Sutter population and N = 4,510,344 Sutter patients did not match to the CCR (group 3; Fig. 1). After the linkage, tumor and demographic characteristics for groups 1 and 3 were obtained for a total of 1,338,114 tumors for 1,147,528 unique patients.

The validation study sample (Fig. 2) included (N = 41,165) patients who were diagnosed with first tumors of the lung/bronchus (N = 7,743), colon/rectum (N = 6,781), female breast (N = 15,953), or prostate (N = 10,688) in 2012 or 2013 and residing in the 22-county catchment region. Initial linkage identified 16,257 as members of the EHR population; this number was reduced after we reclassified the EHR population to include only patients with EHR evidence during their cancer episode (N = 10,659), and we compared them with the non-EHR population (N = 30,506).

Internal validity

We found that 12,280 patients (75.5%) had evidence of a past cancer diagnosis in their EHR records, or were being treated at Sutter concurrent to their cancer episode. The remaining 3,977 patients did not have evidence of cancer in their EHRs. Of these patients, 1,355 (34.1%) were never assigned a PCP and 3,684 (22.7%) had a cancer diagnosis date that was more than 365 days after their last EHR visit. We presume these to be former EHR patients who migrated to another health care facility before their cancer diagnosis, or one-time/occasional EHR patients who visited a Sutter hospital or specialist for an orthopedic surgery or delivery but never elected a PCP. Former/occasional patients were more evenly distributed across SES quintiles than current patients. 293 (1.8%) patients had no evidence of past cancer diagnosis (>365 days before their Sutter encounters) in their EHRs. These patients were more likely to be higher SES and have been diagnosed with cancers of lower stage (Table 1).

External validity and reweighting

The EHR population comprised 25.9% of all first cancers diagnosed in this time period and geographic region (Table 2). This proportion ranged by cancer site, from 21.8% for prostate to 29.7% for female breast, and by county, 5.3% in Napa to 64.0% in Yuba (Supplementary Table S1). The wide range across counties reflects differences in the location of Sutter facilities. Compared with the non-EHR patients in the catchment region, the EHR population was younger [18–54 PR: 1.12; 95% confidence interval (CI): 1.01–1.25] and less likely to be male (PR: 0.82; 95% CI: 0.71–0.96), non-Hispanic Black (PR: 0.78; 95% CI 0.75–0.82), Hispanic (PR: 0.74; 95% CI: 0.72–0.77), or Asian American/Pacific Islander (PR: 0.78; 95% CI: 0.72–0.86), more likely to be higher SES (highest nSES PR: 1.17; 95% CI: 1.03–1.33; lowest nSES PR: 0.93; 95% CI: 0.91–0.96). For insurance type, we observed that a larger proportion of EHR patients claimed Medicare as their primary payer (PR: 1.55; 95% CI: 1.37–1.74), and less used public insurance (PR: 0.71; 95% CI: 0.69–0.73). These patterns varied little by tumor site, with a few exceptions. EHR lung cancer patients were slightly overrepresented non-Hispanic Blacks (PR: 1.08; 95% CI: 1.02–1.14) and their SES distribution was more similar to that of the underlying population. For tumor stage, EHR patients had more in situ cancers overall (PR: 1.22; 95% CI, 1.15–1.28), but these differences disappeared when stratified by tumor site (Supplementary Table S2).

For the IPSW demonstration, we excluded 6,849 patients (16.6%) with unknown and very rare covariate values, resulting in a final analytic sample of 34,316. For model 1, we observed positive relationship between lower SES and later stage, which appeared to follow a linear trend (OR for lowest nSES: 1.958; 95% CI: 1.801–2.129). The unweighted EHR-only models slightly exaggerated these relationships (OR for lowest nSES: 2.022; 95% CI: 1.719–2.378). For model 2, being unmarried was associated with an increased odds of later stage at diagnosis in the full sample (OR: 1.394; 95% CI: 1.329–1.462) and this relationship was slightly attenuated in the unweighted EHR population. For model 3, in the full sample, having public insurance was strongly associated with later stage at diagnosis (OR: 1.901; 95% CI: 1.757–2.055) and also slightly attenuated in the EHR-only population (OR: 1.844; 95% CI: 1.553–2.190). Finally, compared with non-Hispanic Whites, non-Hispanic Blacks (OR: 1.135; 95% CI: 1.043–1.234) and Hispanics (OR: 1.184; 95% CI: 1.104–1.269) had higher adjusted odds of later stage at diagnosis, and these relationships were slightly exaggerated in the EHR-only analysis. The unweighted EHR OR for Asian American and Pacific Islanders was higher, but not significantly different from the null, but in the catchment region and in the reweighted EHR population, AAPIs were more likely to be diagnosed at later stage compared with non-Hispanic Whites (OR: 1.164; 95% CI: 1.028–1.307). For all models, the IPSW procedure was effective at adjusting the ORs in EHR population so that they more closely resembled the full catchment region (Table 3).

We were able to link a large health care system with a statewide population-based cancer registry to identify patients with cancer. Our validation study was designed to demonstrate improved cancer case ascertainment with a system-wide linkage approach and to evaluate the representativeness of our resulting EHR-based cancer cohort, by comparing it to the underlying catchment region.

For our internal validity study, we identified 22.7% of all linkage-identified EHR cancer patients who were diagnosed with cancer after they migrated away from our health care system and a 1.8% of patients who had a history of cancer that was not recorded in their EHRs. The accuracy/representativeness of evidence-in-EHR versus not-in-EHR in our study may demonstrate an “inverse survivorship bias”, that is, patients with more treatable cancers (e.g., colon/rectum) leading to longer survival and hence, more chance of system migration, while less treatable cancers (e.g., lung) have shorter survival and thus less time for migration. While the pattern did not hold for breast cancers (most of which are treatable) this is a type of selection bias (38) and should be considered when relying on EHRs to study cancers of differing prognoses.

The proportion of registry-identified cancers that were not represented in the EHRs points to weaknesses in the targeted linkage approach; however, the implications of these findings depend on the study design. A cohort study of risk factors for cancer based on longitudinal follow-up would suffer from substantial bias if the patients who were diagnosed with cancer after migration remained classified as cancer-free, given that 23% of patients were subsequently identified with cancer via linkage to a population-based cancer registry. In contrary, a study of patients treated for their cancer at the health care system would likely be unharmed by this omission, given only 2% of cases omitted a history of cancer in their EHR. We also found that 34% of Sutter cancer patients who were not represented in the EHR were never assigned a PCP, so the nature of their affiliation to the EHR system was questionable to begin with, and so the study with a narrower focus on just those patients who also had a PCP would be quite robust. A third approach to generating finder files for linkage might be to start with a subset of the patients of interest, for whom key variables might be expected to be collected, for example, the primary care base or by selecting a subset of patients based on information density, which has been identified as an important potential indicator of EHR data quality (41).

Self-selection of patients to a particular health care system is a complex multifactorial mechanism. With some notable exceptions, the EHR population in our study was generally representative of the underlying catchment region. We observed some demographic differences between EHR patients and non-EHR patients in the surrounding region, which may be partially explained by characteristics of the region or health care system. For example, EHR patients were more often female, and older age EHR patients claimed Medicare coverage more often. Some possible explanations for these findings include the availability of Sutter breast cancer specialists in some regions, and presence of large competitor HMO systems. These findings highlight that knowledge of provider availability and market characteristics of catchment region are important for interpretation of these results, and generally for research use of EHR data.

We demonstrated the use of selection probabilities with model-based standardization to improve generalizability of our EHR population to the underlying catchment region. We did not observe substantial differences in the conditional odds of our outcome (late stage at diagnosis) between the EHR population and the full catchment region. Indeed, the reweighting procedure resulted in only one OR that would have changed our statistical inference compared to the unweighted results. This could be an additional indication of the generalizability of our EHR cancer patient population. Alternatively, the observed differences in demographic distributions may not be important for the modeled outcome across strata of the selected and nonselected populations. Either way, undertaking a simple comparison of modeled results between the EHR and the catchment region (even in the absence of implementing IPSW) serves to strengthen an EHR study's external validity.

In order for IPSW to be valid for identifying the causal effect, some strong assumptions are required: there must be absence of other systematic error (confounding, measurement error), and the specified model for probability of selection must be sufficient for rendering the nonselected missing at random. Our IPSW demonstration was fairly ideal in that we had access to the all covariates in both the selected and nonselected population. In a more common scenario, one or more of the variables (e.g., EHR-derived) may predict selection and be unavailable for non-EHR patients. These important unmeasured predictors of selection, such as availability of employer-based health care coverage, or health literacy, cannot be overcome by IPSW, and the credibility of this approach relies on a realistic scenario and robust causal diagram. A less model-driven approach is also possible with knowledge of some but not all selection probabilities, which can be used to adjust descriptive statistics derived from an EHR population, such as incidence or prevalence rates, as is done in weighted survey design.

We found just a few examples of studies that were similar to ours in objective and scope. Related to our findings from the internal validity study, Clarke and colleagues demonstrated the added value of EHRs to identify patients with a history of cancer who might not appear in the statewide tumor registry (15). Relevant to our external validation study, there have been two studies from Kaiser that aimed to characterize the generalizability of their EHR populations, in patients with breast (42) and lung (43) cancer. Selection bias is a known concern for EHR-based research. Hanuese and Daniels developed a framework for evaluating such bias by comparing subsets of a patient population with(out) EHR-derived covariate information (44). Their framework (and other similar studies; refs. 45–48), have emphasized the importance of data provenance (i.e., understanding the technology- and provider-related factors that impact how and why EHR data are generated) when considering bias in EHR research. On the basis of our findings, geography and regional context could be additional candidates for important data provenance considerations (49). IPSW has been used for the purposes of generalizing randomized controlled trial data (50) and generalizing autopsy data to a live source population (51). We also found one instance of its use with EHRs in a study of childhood obesity (52).

EHR systems are vast databases that do not have easily accessible research-ready data tables. Research use of EHRs requires a knowledgeable support team experienced in interpreting researcher questions, and extracting the necessary data. Large-scale data initiatives like the one we have described also rely on sharing protected health information (PHI) across institutions to improve scale and validity, but maintaining patient privacy is a key challenge. This requires both secure processes and multiple organizational agreements. We obtained all necessary privacy and legal approvals (from all organizations involved) and extracted names, birthdates, sex, zip codes, and last 4 digits of social security number for 4.5 million adult members of an EHR population. Because of the size of the populations studied, consent was not sought for participation in this study, but we instead obtained authorization for a waiver of consent. Upon return of the cancer details for successfully linked patients (with personal identifiers removed), we additionally extracted “limited” (excluding direct personal identifiers) EHR data pertaining to select patients' cancer care for the validation study. New approaches may allow health care provider entities covered by the Health Insurance Portability and Accountability Act (HIPAA) to share data for patient identification and linkage across data sources and greatly reduce the time and effort required to accomplish a study like ours (53).

Limitations

Linkage with the statewide registry only ascertains reportable cases, i.e., those who lived in the catchment area at the time of diagnosis. Out of state residents who sought care at a Sutter facility would not be captured. False positives are possible with probabilistic linkage, however it has been shown to be valid and, compared to deterministic linkages processes, it is better suited for large data (54–56). Our choice of cancers in the validation sample may impact the results of our internal validity checks. For example, some system providers may be well respected in the medical community and be regularly sought out only for second opinions, which would increase the number of cancer patients who link to the statewide registry, but who are receiving the majority of their care elsewhere.

EHR-based study populations are a convenience sample, but the efficiency of such studies often outweighs the drawbacks. The representativeness of any research database has important implications for the generalizability of findings and recommendations for policy or evidenced-based treatment strategies. Our experiences help encourage and inform researchers interested in working with EHRs for cancer research as well as provide context for leveraging linkages to assess and improve study validity and generalizability.

No potential conflicts of interest were disclosed.

The ideas and opinions expressed herein are those of the author(s) and do not necessarily reflect the opinions of the State of California, Department of Public Health, the National Cancer Institute, and the Centers for Disease Control and Prevention or their Contractors and Subcontractors.

Conception and design: C.A. Thompson, S.L. Gomez

Development of methodology: C.A. Thompson, A. Jin

Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): C.A. Thompson, H.S. Luft, S.-Y. Liang, S.L. Gomez

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): C.A. Thompson, A. Jin, D.Y. Lichtensztajn, S.-Y. Liang, B.T. Schumacher

Writing, review, and/or revision of the manuscript: C.A. Thompson, H.S. Luft, D.Y. Lichtensztajn, S.-Y. Liang, B.T. Schumacher, S.L. Gomez

Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): C.A. Thompson, A. Jin, L. Allen, B.T. Schumacher

Study supervision: C.A. Thompson, L. Allen, S.L. Gomez

The authors would like to extend their thanks and gratitude to the research staff who supported this effort, including Rita Leung, Sarah Knowles, Pragati Kenkare, and Mai Vu. Funding for the linkage was provided by the NCI as part of a Surveillance, Epidemiology, and End Results (SEER) Rapid Response Surveillance Study on Patient Generated Health Data (HHSN261201300005I). C.A. Thompson was funded by a career development award from the National Institute for Advancing Translational Sciences (KL2TR001444). The collection of cancer incidence data used in this study was supported by the California Department of Public Health pursuant to California Health and Safety Code Section 103885; Centers for Disease Control and Prevention's (CDC) National Program of Cancer Registries under cooperative agreement 5NU58DP006344; the NCI's SEER Program under contract HHSN261201800032I awarded to the University of California San Francisco, contract HHSN261201800015I awarded to the University of Southern California, and contract HHSN261201800009I awarded to the Public Health Institute, Cancer Registry of Greater California.

1.
Yu
P
,
Artz
D
,
Warner
J
. 
Electronic health records (EHRs): supporting ASCO's vision of cancer care
.
Am Soc Clin Oncol Educ Book
2014
:
225
31
.
2.
Yu
PP
. 
The evolution of oncology electronic health records
.
Cancer J
2011
;
17
:
197
202
.
3.
Miriovsky
BJ
,
Shulman
LN
,
Abernethy
AP
. 
Importance of health information technology, electronic health records, and continuously aggregating data to comparative effectiveness research and learning health care
.
J Clin Oncol
2012
;
30
:
4243
8
.
4.
Warner
J
,
Hochberg
E
. 
Where is the EHR in oncology?
J Natl Compr Canc Netw
2012
;
10
:
584
8
.
5.
Weiner
MG
,
Lyman
JA
,
Murphy
S
,
Weiner
M
. 
Electronic health records: high-quality electronic data for higher-quality clinical research
.
Inform Prim Care
2007
;
15
:
121
7
.
6.
Hughes
AE
,
Tiro
JA
,
Balasubramanian
BA
,
Skinner
CS
,
Pruitt
SL
. 
Social disadvantage, healthcare utilization, and colorectal cancer screening: leveraging longitudinal patient address and health records data
.
Cancer Epidemiol Biomarkers Prev
2018
;
27
:
1424
32
.
7.
Thompson
CA
,
Gomez
SL
,
Chan
A
,
Chan
JK
,
McClellan
SR
,
Chung
S
, et al
Patient and provider characteristics associated with colorectal, breast, and cervical cancer screening among Asian Americans
.
Cancer Epidemiol Biomarkers Prev
2014
;
23
:
2208
17
.
8.
Mayer
MA
,
Gutierrez-Sacristan
A
,
Leis
A
,
De La Pena
S
,
Sanz
F
,
Furlong
LI
. 
Using electronic health records to assess depression and cancer comorbidities
.
Stud Health Technol Inform
2017
;
235
:
236
40
.
9.
Young-Wolff
KC
,
Klebaner
D
,
Folck
B
,
Tan
ASL
,
Fogelberg
R
,
Sarovar
V
, et al
Documentation of e-cigarette use and associations with smoking from 2012 to 2015 in an integrated healthcare delivery system
.
Prev Med
2018
;
109
:
113
8
.
10.
Huo
J
,
Yang
M
,
Tina Shih
Y-C
. 
Sensitivity of claims-based algorithms to ascertain smoking status more than doubled with meaningful use
.
Value in Health
2018
;
21
:
334
40
.
11.
Schinasi
LH
,
Auchincloss
AH
,
Forrest
CB
,
Diez Roux
AV
. 
Using electronic health record data for environmental and place based population health research: a systematic review
.
Ann Epidemiol
2018
;
28
:
493
502
.
12.
Cole
AM
,
Pflugeisen
B
,
Schwartz
MR
,
Miller
SC
. 
Cross sectional study to assess the accuracy of electronic health record data to identify patients in need of lung cancer screening
.
BMC Research Notes
2018
;
11
:
14
.
13.
Häyrinen
K
,
Saranto
K
,
Nykänen
P
. 
Definition, structure, content, use and impacts of electronic health records: a review of the research literature
.
Int J Med Inf
2008
;
77
:
291
304
.
14.
Vuokko
R
,
Mäkelä-Bengs
P
,
Hyppönen
H
,
Lindqvist
M
,
Doupi
P
. 
Impacts of structuring the electronic health record: Results of a systematic literature review from the perspective of secondary use of patient data
.
Int J Med Inf
2017
;
97
:
293
303
.
15.
Clarke
CL
,
Feigelson
HS
. 
Developing an algorithm to identify history of cancer using electronic medical records
.
EGEMS (Wash DC)
2016
;
4
:
1209
.
16.
Sheikhalishahi
S
,
Miotto
R
,
Dudley
JT
,
Lavelli
A
,
Rinaldi
F
,
Osmani
V
. 
Natural language processing of clinical notes on chronic diseases: systematic review
.
JMIR Med Inform
2019
;
7
:
e12239
.
17.
Jacobs
EJ
,
Briggs
PJ
,
Deka
A
,
Newton
CC
,
Ward
KC
,
Kohler
BA
, et al
Follow-up of a large prospective cohort in the united states using linkage with multiple state cancer registries
.
Am J Epidemiol
2017
;
186
:
876
84
.
18.
Cancer Registries Amendment Act, S.33121992, 102nd Cong., 2nd Sess. 1991–1992.
19.
National Program of Cancer Registries Program Standards 2012–2017 (updated January 2013). Atlanta (GA): Centers for Disease Control and Prevention
; 
2013
.
Available from
: https://www.cdc.gov/cancer/npcr/pdf/npcr_standards.pdf.
20.
Thoburn
KK
,
German
RR
,
Lewis
M
,
Nichols
PJ
,
Ahmed
F
,
Jackson-Thompson
J
. 
Case completeness and data accuracy in the centers for disease control and prevention's national program of cancer registries
.
Cancer
2007
;
109
:
1607
16
.
21.
Kurian
AW
,
Mitani
A
,
Desai
M
,
Yu
PP
,
Seto
T
,
Weber
SC
, et al
Breast cancer treatment across health care systems: linking electronic medical records and state registry data to enable outcomes research
.
Cancer
2014
;
120
:
103
11
.
22.
Sherman
RE
,
Anderson
SA
,
Dal Pan
GJ
,
Gray
GW
,
Gross
T
,
Hunter
NL
, et al
Real-world evidence - what is it and what can it tell us?
N Engl J Med
2016
;
375
:
2293
7
.
23.
Mahajan
R
. 
Real world data: additional source for making clinical decisions
.
Int J Appl Basic Med Res
2015
;
5
:
82
.
24.
Khozin
S
,
Blumenthal
GM
,
Pazdur
R
. 
Real-world data for clinical evidence generation in oncology
.
J Natl Cancer Inst
2017
;
109
.
doi: 10.1093/jnci/djx187
.
25.
Rusanov
A
,
Weiskopf
NG
,
Wang
S
,
Weng
C
. 
Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research
.
BMC Med Inform Decis Mak
2014
;
14
:
51
.
26.
Bower
JK
,
Patel
S
,
Rudy
JE
,
Felix
AS
. 
Addressing bias in electronic health record-based surveillance of cardiovascular disease risk: finding the signal through the noise
.
Curr Epidemiol Rep
2017
;
4
:
346
52
.
27.
Verheij
RA
,
Curcin
V
,
Delaney
BC
,
McGilchrist
MM
. 
Possible sources of bias in primary care electronic health record data use and reuse
.
J Med Internet Res
2018
;
20
:
e185
.
28.
Goldstein
BA
,
Bhavsar
NA
,
Phelan
M
,
Pencina
MJ
. 
Controlling for informed presence bias due to the number of health encounters in an electronic health record
.
Am J Epidemiol
2016
;
184
:
847
55
.
29.
Weber
GM
,
Adams
WG
,
Bernstam
EV
,
Bickel
JP
,
Fox
KP
,
Marsolo
K
, et al
Biases introduced by filtering electronic health records for patients with "complete data"
.
J Am Med Inform Assoc
2017
;
24
:
1134
41
.
30.
Desai
JR
,
Wu
P
,
Nichols
GA
,
Lieu
TA
,
O'Connor
PJ
. 
Diabetes and asthma case identification, validation, and representativeness when using electronic health data to construct registries for comparative effectiveness and epidemiologic research
.
Med Care
2012
;
50
Suppl
:
S30
5
.
31.
Stuart
EA
,
DuGoff
E
,
Abrams
M
,
Salkever
D
,
Steinwachs
D
. 
Estimating causal effects in observational studies using electronic health data: challenges and (some) solutions
.
EGEMS (Wash DC)
2013
;
1
.
doi: 10.13063/2327-9214.1038
.
32.
Rothman
K
,
Greenland
S
,
Lash
T
.
Modern epidemiology. 3rd ed
.
Philadelphia
:
Lippincott Williams & Wilkins
; 
2008
.
33.
Kalton
G
,
Flores-Cervantes
I
. 
Weighting methods
.
J Off Stat
2003
;
19
:
81
97
.
34.
Thompson
CA
,
Arah
OA
. 
Selection bias modeling using observed data augmented with imputed record-level probabilities
.
Ann Epidemiol
2014
;
24
:
747
53
.
35.
Yang
J
,
Schupp
C
,
Harrati
A
,
Clarke
C
,
Keegan
T
,
Gomez
S
. 
Developing an area-based socioeconomic measure from American community survey data
.
Fremont (CA)
:
Cancer Prevention Institute of California
; 
2014
.
36.
Greenland
S
,
Pearl
J
,
Robins
JM
. 
Causal diagrams for epidemiologic research
.
Epidemiology
1999
;
10
:
37
48
.
37.
Greenland
S
. 
Quantifying biases in causal models: classical confounding vs. collider-stratification bias
.
Epidemiology
2003
;
14
:
300
6
.
38.
Hernan
MA
,
Hernandez-Diaz
S
,
Robins
JM
. 
A structural approach to selection bias
.
Epidemiology
2004
;
15
:
615
25
.
39.
Robins
JM
,
Hernan
MA
,
Brumback
B
. 
Marginal structural models and causal inference in epidemiology
.
Epidemiology
2000
;
11
:
550
60
.
40.
Mansournia
MA
,
Altman
DG
. 
Inverse probability weighting
.
BMJ
2016
;
352
:
i189
.
41.
Reimer
AP
,
Milinovich
A
,
Madigan
EA
. 
Data quality assessment framework to assess electronic medical record data for use in research
.
Int J Med Inf
2016
;
90
:
40
7
.
42.
Gomez
SL
,
Shariff-Marco
S
,
Von Behren
J
,
Kwan
ML
,
Kroenke
CH
,
Keegan
TH
, et al
Representativeness of breast cancer cases in an integrated health care delivery system
.
BMC Cancer
2015
;
15
:
688
.
43.
Check
DK
,
Albers
KB
,
Uppal
KM
,
Suga
JM
,
Adams
AS
,
Habel
LA
, et al
Examining the role of access to care: Racial/ethnic differences in receipt of resection for early-stage non-small cell lung cancer among integrated system members and non-members
.
Lung Cancer
2018
;
125
:
51
6
.
44.
Haneuse
S
,
Daniels
M
. 
A general framework for considering selection bias in EHR-based studies: what data are observed and why?
EGEMS (Wash DC)
2016
;
4
:
1203
.
45.
Johnson
KE
,
Kamineni
A
,
Fuller
S
,
Olmstead
D
,
Wernli
KJ
. 
How the provenance of electronic health record data matters for research: a case example using system mapping
.
EGEMS (Wash DC)
2014
;
2
:
1058
.
46.
Thompson
CA
,
Kurian
AW
,
Luft
HS
. 
Linking electronic health records to better understand breast cancer patient pathways within and between two health systems
.
EGEMS (Wash DC)
2015
;
3
:
1127
.
47.
Hersh
WR
,
Cimino
J
,
Payne
PR
,
Embi
P
,
Logan
J
,
Weiner
M
, et al
Recommendations for the use of operational electronic health record data in comparative effectiveness research
.
EGEMS (Wash DC)
2013
;
1
:
1018
.
48.
Hersh
WR
,
Weiner
MG
,
Embi
PJ
,
Logan
JR
,
Payne
PR
,
Bernstam
EV
, et al
Caveats for the use of operational electronic health record data in comparative effectiveness research
.
Med Care
2013
;
51
(8 Suppl 3):
S30
7
.
49.
Kroneman
M
,
Verheij
R
,
Tacken
M
,
van der Zee
J
. 
Urban-rural health differences: primary care data and self reported data render different results
.
Health Place
2010
;
16
:
893
902
.
50.
Buchanan
AL
,
Hudgens
MG
,
Cole
SR
,
Mollan
KR
,
Sax
PE
,
Daar
ES
, et al
Generalizing evidence from randomized trials using inverse probability of sampling weights
.
J Roy Stat Soc Ser A
2018
;
181
:
1193
209
.
51.
Haneuse
S
,
Schildcrout
J
,
Crane
P
,
Sonnen
J
,
Breitner
J
,
Larson
E
. 
Adjustment for selection bias in observational studies with application to the analysis of autopsy data
.
Neuroepidemiology
2009
;
32
:
229
39
.
52.
Flood
TL
,
Zhao
YQ
,
Tomayko
EJ
,
Tandias
A
,
Carrel
AL
,
Hanrahan
LP
. 
Electronic health records and community health surveillance of childhood obesity
.
Am J Prev Med
2015
;
48
:
234
40
.
53.
Datavant partners with the People-Centered Research Foundation to de-identify and link data across national clinical research network
.
Datavant. 2019 Aug 13. Available from
: https://www.prnewswire.com/news-releases/datavant-partners-with-the-people-centered-research-foundation-to-de-identify-and-link-data-across-national-clinical-research-network-300900349.html.
54.
Clark
DE
,
Hahn
DR
. 
Comparison of probabilistic and deterministic record linkage in the development of a statewide trauma registry
.
Proc Annu Symp Comput Appl Med Care
1995
:
397
401
.
55.
Tromp
M
,
Ravelli
AC
,
Bonsel
GJ
,
Hasman
A
,
Reitsma
JB
. 
Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage
.
J Clin Epidemiol
2011
;
64
:
565
72
.
56.
Garvin
JH
,
Herget
KA
,
Hashibe
M
,
Kirchhoff
AC
,
Hawley
CW
,
Bolton
D
, et al
Linkage between Utah all payers claims database and central cancer registry
.
Health Serv Res
2019
;
54
:
707
13
.