Abstract
Although many diseases are associated with cancer, the full spectrum of temporal disease correlations across cancer types has not yet been characterized. A population-wide study of longitudinal disease trajectories is needed to interrogate the general medical histories of patients with cancer. Here we performed a retrospective study covering a 20-year period, using 6.9 million patients from the Danish National Patient Registry linked to 0.7 million patients with cancer from the Danish Cancer Registry. Statistical analysis identified all significant disease associations occurring prior to cancer diagnoses. These associations were used to build frequently occurring, longitudinal disease trajectories. Across 17 cancer types, a total of 648 significant diagnoses correlated directly with a cancer, while 168 diagnosis trajectories of time-ordered steps were identified for seven cancer types. The most common diseases across cancer types involved cardiovascular, obesity, and genitourinary diseases. A comprehensive, publicly available web tool of interactive illustrations for all cancer disease associations is provided. By exploring the precancer landscape using this large dataset, we identify disease associations that can be used to derive mechanistic hypotheses for future cancer research.
This study offers an innovative approach to examine prediagnostic disease and cancer development in a large national population-based setting and provides a publicly available tool to foster additional cancer surveillance research.
Introduction
Patients with cancer often have diverse histories of hospital encounters before they acquire their first cancer diagnosis. Longitudinal health care information can be used to reveal causes of cancer at the level of individuals such as repeated or prolonged infections with oncogenic pathogens due to impaired immune surveillance (1). A given cancer type may be caused by different processes, which might be reflected in different prehistories and multimorbidities associated with the ultimate development of one cancer.
It is clinically apparent that there is a substantial variation in precancer medical histories among patients ultimately diagnosed with the same cancer type. One example is head and neck cancer where evidence shows that it can arise from two distinct etiologies (one human papillomavirus–related and another alcohol and smoking-related), which should be treated as two separate entities (2). Another example is chronic infectious diseases associated with increased risk of lymphoma, which could be due to sources such as chronic inflammation or immunodeficiency caused by for example immunosuppressive therapy.
Healthcare data are increasingly being applied in biomedical research to investigate patient stratification, comorbidities, and clinical outcomes (3). A patient can traverse multiple diseases during a lifetime driven by risk factors such as lifestyle, genetics, or treatment-provoked disease associations (4). Some studies have already performed disease trajectory analysis on healthcare data, for example, studies that either use population-wide registry data (6–7 million patients) from a single country (5) or small sets of electronic patient records (6). Diverse cancer routes converging toward sepsis have also been analyzed previously using such a longitudinal approach (7). However, no study has systematically mapped cancer disease associations and trajectory trends using large cohorts without selection bias (8) followed over several decades. Here we use a population-wide approach to delineate all precancer diagnosis pairs and diagnosis trajectories using the Danish Cancer Registry (9) and the Danish National Patient Registry (NPR; ref. 10) in the period 1994–2015. As the registry covers a nation-wide healthcare system, selection biases such as selective inclusion of specific patient groups/hospitals, income levels, age groups, and genders are eliminated (8).
When using diagnosis trajectories to infer causal deductions between a cancer and its disease cooccurrences, a key challenge is to distinguish between diagnoses that drive cancer formation from those that do not. Diagnoses that arise just before the cancer diagnosis may derive from the cancer (e.g., pneumonia preceding lung cancer), whereas those emerging after the cancer diagnosis may be the result of complications to treatment. Hence, we designed this study to acquire the full temporal map of all diagnoses occurring years before the first cancer diagnosis. By applying temporality as a tool, it would be assumed that those diagnoses that occurred long before the cancer is diagnosed are more likely be causally related to the development of the cancer than those appearing closer to the diagnostic timepoint even if the temporality in itself does not represent a proof.
The complex map of specific diagnoses upstream cancers has been made available as an interactive web catalog.
Materials and Methods
Study design
The objective of this retrospective cohort study was to determine and characterize disease associations and trajectories prior to cancer diagnoses using population-wide electronic health registry data. The trajectories were derived by joining temporal diagnosis correlations using a previously published statistical approach (5).
The data used in the study is from the Danish NPR and the Danish Cancer Registry. Registry data comprise administrative data and primary and secondary diagnoses covering all hospital encounters in Denmark. All diagnoses are classified using the International Classification of Diseases (ICD-10) classification system, which has a hierarchical structure. ICD-10 codes can thus be rounded to a higher-level diagnosis code, block, or chapter. Here, we used level 3 codes (e.g., C50 for breast cancer).
Diagnoses occurring after first cancer diagnoses were left out to reduce confounding posttreatment-related effects from the analysis. To reduce confounding from these effects, our analysis disregards diagnoses following the cancer and only uses the first cancer diagnosis for each patient. Correspondingly, we only included the first occurrence of all noncancer diagnoses in the analysis prior to the cancer diagnosis. In this way, we remove many of the posttreatment effects and achieve balance between cancer and noncancer diagnoses.
Patient cohorts
NPR covers all encounters with Danish hospitals since 1977 including public and private hospital visits. In 1995, data were added for patients from emergency rooms and psychiatric wards (outpatients). The Cancer Registry contains information on cancer incidences for the Danish population since 1943 and onwards. Reporting has been made mandatory since 1987 and notifications were made electronic since 2004. In addition, it includes information on tumor characteristics such as tumor staging, topography and morphology, and treatment strategies. Our extracts from NPR and the Cancer Registry cover the period 1994–2015 and include 6.9 million patients and 0.8 million patients (0.7 million malignant cases), respectively. We did not include data from before 1994 from the two registries, because these were coded in the older version ICD-8. Patient-specific information could be linked between the Cancer Registry and NPR by using the nationwide Danish identification numbers. All patient identification numbers were deidentified during analysis. The median age and sex frequency for each cancer type are listed in Table 1.
Characteristics and statistics for cancer types directly associated with a diagnosis
. | . | . | . | . | . | Directional (D→C) disease pairs . | |
---|---|---|---|---|---|---|---|
. | ICD-10 . | Number of incidencesin cohort . | Median age atdiagnosis . | %Male . | %Female . | N patients . | N pairs . |
Skin | C44 | 158,970 | 68 | 48 | 52 | 52,252 | 89 |
Lung | C34 | 70,875 | 69 | 55 | 45 | 27,931 | 288 |
Prostate | C61 | 57,840 | 72 | 100 | 0 | 23,177 | 73 |
Breast | C50 | 88,609 | 63 | 1 | 99 | 8,189 | 31 |
Stomach | C16 | 10,389 | 70 | 64 | 36 | 2,994 | 69 |
Corpus uteri | C54 | 12,281 | 67 | 0 | 100 | 1,869 | 3 |
Ovary | C56 | 10,868 | 65 | 0 | 100 | 1,325 | 20 |
Diffuse NHL | C83 | 3,372 | 67 | 58 | 42 | 998 | 32 |
Other NHL | C85 | 3,140 | 67 | 54 | 46 | 949 | 21 |
Thyroid | C73 | 3,780 | 51 | 27 | 73 | 638 | 1 |
Cervix uteri | C53 | 8,482 | 49 | 0 | 100 | 273 | 4 |
Follicular NHL | C82 | 1,344 | 62 | 49 | 51 | 100 | 3 |
Tonsil | C09 | 2,503 | 59 | 71 | 27 | 90 | 5 |
Oropharynx | C10 | 1,176 | 61 | 71 | 29 | 89 | 5 |
Anus and anal canal | C21 | 1,876 | 63 | 30 | 70 | 44 | 2 |
Vulva | C51 | 1,829 | 73 | 0 | 100 | 23 | 1 |
Tongue | C01 | 593 | 60 | 76 | 24 | 16 | 1 |
. | . | . | . | . | . | Directional (D→C) disease pairs . | |
---|---|---|---|---|---|---|---|
. | ICD-10 . | Number of incidencesin cohort . | Median age atdiagnosis . | %Male . | %Female . | N patients . | N pairs . |
Skin | C44 | 158,970 | 68 | 48 | 52 | 52,252 | 89 |
Lung | C34 | 70,875 | 69 | 55 | 45 | 27,931 | 288 |
Prostate | C61 | 57,840 | 72 | 100 | 0 | 23,177 | 73 |
Breast | C50 | 88,609 | 63 | 1 | 99 | 8,189 | 31 |
Stomach | C16 | 10,389 | 70 | 64 | 36 | 2,994 | 69 |
Corpus uteri | C54 | 12,281 | 67 | 0 | 100 | 1,869 | 3 |
Ovary | C56 | 10,868 | 65 | 0 | 100 | 1,325 | 20 |
Diffuse NHL | C83 | 3,372 | 67 | 58 | 42 | 998 | 32 |
Other NHL | C85 | 3,140 | 67 | 54 | 46 | 949 | 21 |
Thyroid | C73 | 3,780 | 51 | 27 | 73 | 638 | 1 |
Cervix uteri | C53 | 8,482 | 49 | 0 | 100 | 273 | 4 |
Follicular NHL | C82 | 1,344 | 62 | 49 | 51 | 100 | 3 |
Tonsil | C09 | 2,503 | 59 | 71 | 27 | 90 | 5 |
Oropharynx | C10 | 1,176 | 61 | 71 | 29 | 89 | 5 |
Anus and anal canal | C21 | 1,876 | 63 | 30 | 70 | 44 | 2 |
Vulva | C51 | 1,829 | 73 | 0 | 100 | 23 | 1 |
Tongue | C01 | 593 | 60 | 76 | 24 | 16 | 1 |
Abbreviations: C, cancer diagnosis; D, noncancer diagnosis.
To include the most reliable cancer hospital encounters, we excluded cancer diagnoses in the NPR not present in the curated Cancer Registry, because these can be false-positive cases. Furthermore, we included cancer cases from the Cancer Registry not registered in NPR for patients already in NPR.
Diagnosis pair correlations and construction of disease trajectories
We applied a previously published method (5) to calculate significant directional diagnosis pairs and join these into disease trajectories. The strength of correlation between a pair of diagnoses was estimated using RR scores (Eq. A). For a given pair of diagnoses, D1 followed by D2, an exposed group of patients was created by identifying all discharges with D1 assigned. Each of the exposed groups were compared to N = 10,000 randomly sampled comparison groups matched by age, sex, type of hospital encounter (inpatient, outpatient, and emergency room), and calendar time (discharge week). Matching by discharge week was done to eliminate seasonal fluctuations in, for example, diagnoses patterns that could have confounding effects on the findings. Subsequently, the occurrence of D2 within a timeframe of 15 years from the first D1 discharge was counted for both the exposed and comparison groups. This count was denoted Cexposed for the exposed group and Ci for the ith comparison group. As the population sizes are nexposed for both the exposed and all comparison groups, the RR is given by:
P values for the RR were obtained using a binomial distribution, where Cexposed is compared with the average probability of sampling a control patient with the second disease within the timeframe (5). When testing hundreds of thousands of correlations in a batch manner, it is very difficult to ensure that the assumptions for advanced hierarchical models are met and that they will converge. The strength of the method here is that it does not rely on assumptions based on Poisson distributions, nontime-dependent proportional hazards, or similar concepts.
Next, the method tests for significant directionality (D1 → D2) between each significantly correlated diagnosis pair. This is done with a binomial test comparing the number of times the first diagnosis precedes the second to a probability of 50%. Finally, longitudinal disease trajectories were generated joining significant directional diagnosis pairs into trajectories of various lengths. Only trajectories with at least 100 patients following it were included.
To address the issue of multiple testing, Bonferroni correction was applied to both the P values associated with RR and with directionality (D1 → D2). The P values for directionality were only corrected for those diagnosis pairs that passed the RR significance cut-off value of 1.
Temporal analysis of precancer diagnoses
We selected all significant directional diagnosis pairs with a RR > 1 that comprise of a noncancer diagnosis leading to a cancer diagnosis (D1 → C1). For each diagnosis pair, the differential time (in years) was calculated between D1 and C1 for each patient. The average differential time across patients with the specific diagnosis pair was then used as an estimate of the time occurrence of D1 before C1. In this analysis, we have excluded inverse relationships (RR > 1) although they are of significant interest in relation to preventive approaches or improved understanding of cancer etiology (11). There is now an interesting literature on inverse relationships, for example psychiatric disease in relation to lower incidence levels of certain cancers (12, 13). However, the inverse relationships are more sensitive to confounding factors, for example, different smoking or dietary patterns in psychiatric patients that can alter cancer incidences. We do in general not have access to exposome data at the population-wide scale and therefore decided to not include inverse relationships to avoid presenting spurious associations of this kind. Underdiagnosis is another problem that may lead to the spurious identification of inverse comorbidities. One example is cancer and Alzheimer's, where a comparison with vascular dementia may be needed to establish that the lower risk of cancer in patients with Alzheimer disease is likely correct and not caused by general underdiagnosis of dementia in a particular cohort (14). It has also been reported that underdiagnoses, when analyzing chronic obstructive pulmonary disease (COPD) and its comorbidities, including lung cancer, may contribute to explaining the varied incidence and prevalence of common comorbidities between studies (15).
Construction of disease trajectories with aggregated diagnoses
Aggregated diagnoses were also created for both cancer and noncancer diagnoses and analyzed by chapter in the ICD-10 classification system. Cancer types that have previously been linked to risk factors: infectious agents, hormonal factors, or smoking in the literature (Supplementary Table S1) were grouped into these respective groups. Noncancer diseases related to immune dysfunction were grouped and diseases that could relate to aging-related causes were grouped. The same approach was applied to find disease correlations and trajectories as for the nongrouped analysis.
Interactive figures
All results are freely available as a web catalog of interactive figures at http://cprlinxweb05.sund.ku.dk/.
Data and materials approval
The NPR and Cancer Registry are protected by the Danish Act on Processing of Personal Data and can only be accessed through application. We obtained approval through the Danish Data Protection Agency, Copenhagen (ref. no. 2015–54–0939) and the National Board of Health, Copenhagen (ref. no. FSEID-00001627). Informed consent and assessment of the proposal in scientific ethical committees are not required for registry-based research in Denmark.
Results
Generating precancer diagnosis maps
Across all ICD-10 chapters, we carried out a population-wide statistical analysis to obtain directional disease cooccurrence pairs of significance for patients with cancer. We obtained 10,675 significant cancer-associated diagnosis pairs (Bonferroni-corrected P ≤ 0.05) with a RR higher than 1.0 (where the second diagnosis occurred within a 15-year time frame of the first disease; Supplementary Fig. S1). Of these, 3,526 pairs had a significant direction (Bonferroni-corrected P ≤ 0.05; Supplementary Table S2). To analyze single diseases directly associated with cancer, we selected the 648 pairs that comprised one cancer diagnosis. Overall, these pairs span 17 cancer types and the number of diagnosis pairs varies from one to 288 pairs between cancer types (Table 1). The cancer types with the most diagnosis pairs were lung (288 pairs), skin (89 pairs), and stomach (69 pairs).
Figure 1 presents a comprehensive precancer map of diagnoses for the eight most prevalent cancer types based on data spanning 15 years prior to a cancer diagnosis (for remaining cancer types and specific diagnoses, see Supplementary Table S3 and Supplementary Fig. S2, respectively). The results reflect current knowledge for recognized disease associations including benign mammary dysplasia prior to breast cancer (5.3 years; RR, 2.1), hyperplasia of prostate prior to prostate cancer (2.0 years; RR, 2.0), and COPD prior to lung cancer (1.5 years; RR, 2.2; for all RR scores see Supplementary Table S2 and Supplementary Fig. S3). These cases serve as positive controls to demonstrate that the method reliably can detect known disease associations. Figure 1 furthermore shows that the precancer disease maps vary both in terms of time-wise accumulation of diagnoses and frequency of ICD-10 chapters. For example, breast cancer–related disease pairs occurred further from the cancer diagnosis than those developing prior to lung cancer. Also, diseases related to lung and stomach cancer span substantially more ICD-10 chapters than those of hormone-related cancers (breast and ovarian cancer).
A stacked density plot of precancer disease occurrences across 10 years. The density of noncancer diagnoses occurrences before the cancer types (with at least 20 significant directional diagnosis pairs); breast, prostate, ovary, lung, skin, stomach, diffuse NHL (diff. NHL), and other NHL (other NHL) are shown at the ICD-10 chapter level represented by the colors. The timeline spans 15 years before first cancer diagnosis, but we only detected diagnoses within 10 years. The time at each diagnosis is averaged across all patients with cancer with that specific disease pair. Only disease pairs with a significant directionality are included. Significant disease pairs without directionality can be found in Supplementary Fig. S1.
A stacked density plot of precancer disease occurrences across 10 years. The density of noncancer diagnoses occurrences before the cancer types (with at least 20 significant directional diagnosis pairs); breast, prostate, ovary, lung, skin, stomach, diffuse NHL (diff. NHL), and other NHL (other NHL) are shown at the ICD-10 chapter level represented by the colors. The timeline spans 15 years before first cancer diagnosis, but we only detected diagnoses within 10 years. The time at each diagnosis is averaged across all patients with cancer with that specific disease pair. Only disease pairs with a significant directionality are included. Significant disease pairs without directionality can be found in Supplementary Fig. S1.
Recurrent diagnoses across cancer types
The most frequently shared diagnosis is excessive, frequent, and irregular menstruation from chapter XIV (Diseases from the Genitourinary System), which was observed across 11 cancer types (Fig. 2 and web catalog) with RR scores ranging from 4 to 8. In the same chapter, both male infertility (prostate and skin) and female infertility (breast, lung, and skin) were observed with extremely high RR scores, ranging from 12 to 32. Other common diagnoses from this chapter are hyperplasia of prostate (six cancer types), female genital prolapse (five cancer types), dysplasia of cervix uteri (five cancer types), and hydrocele and spermatocele (five cancer types).
Overlap of cancer-associated diagnoses across 17 cancer types. The occurrence of overlapping cancer-associated diagnoses across cancer types is shown, where each needle represents a significant cancer-associated diagnosis. Most of the overlapping diagnoses are in chapter 9 (Diseases of the Circulatory System) and chapter 13 (Diseases of Musculoskeletal System and Connective Tissue). Top overlapping diagnoses (n ≥ 5) are labeled with their corresponding diagnosis name. The colors represent the ICD-10 chapters (see Fig. 1 for color legend).
Overlap of cancer-associated diagnoses across 17 cancer types. The occurrence of overlapping cancer-associated diagnoses across cancer types is shown, where each needle represents a significant cancer-associated diagnosis. Most of the overlapping diagnoses are in chapter 9 (Diseases of the Circulatory System) and chapter 13 (Diseases of Musculoskeletal System and Connective Tissue). Top overlapping diagnoses (n ≥ 5) are labeled with their corresponding diagnosis name. The colors represent the ICD-10 chapters (see Fig. 1 for color legend).
Interestingly, one of the most pervasive disease groups is chapter IX (Diseases of the Circulatory System) involving diagnoses such as angina pectoris (eight cancer types), chronic ischemic heart disease (five cancer types), and paroxysmal tachycardia (four cancer types). Lung cancer has up to 30 different disorders in this chapter with most accumulating at 1–2 years before the cancer diagnosis, followed by stomach cancer with 11 different disorders.
Diagnoses from chapter XI (Diseases of the Digestive System), such as hernia, cholelithiasis, and irritable bowel disease were also prevalent among cancer types. A more cancer type–specific chapter observed was chapter IV (Endocrine, Nutritional and Metabolic Diseases), observed in diffuse non–Hodgkin lymphoma (NHL), lung, and stomach cancer, which include the diagnoses hyperthyroidism, type 2 diabetes mellitus, and disorders of lipoprotein metabolism and other lipidemias.
Analyzing diseases by their temporal proximities to cancer
We searched specifically for diagnoses for which the pathophysiologic cause could be made clearer in light of their temporal proximity to cancer. Diagnoses in chapter I (Certain Infectious and Parasitic Diseases) reside both far and close in time to the cancer diagnosis in several cancer types. Examples of infectious “close-up” diagnoses (1–2 years before) are infectious diarrhea and gastroenteritis and viral infections of central nervous system. Infectious diagnoses more distant (3–8 years before) are spirochetal infections or sexually transmitted diseases. The years prior to the cancer diagnoses lack statistically significant diagnoses (see Supplementary Fig. S4 for nondirectional pairs). Other chapters such as chapter IV (Endocrine, Nutritional and Metabolic Diseases) have the diagnoses obesity, noninsulin-dependent diabetes, and disorders of lipoprotein metabolism in closer proximity (1–2 years) to the cancer types lung, stomach, and diffuse NHL.
Interestingly, chapter XIV (Diseases of the Genitourinary System), one of the chapters with highest prevalence among cancer types, had several diagnoses residing distant from the cancer, such as irregular menstruation (9 years), infertility (6–7 years), and endometriosis (6–7 years). On the other hand, diagnoses from the same chapter, but related to tumor symptoms such as benign mammary dysplasia or hyperplasia of prostate, were found to occur much closer to the time of cancer diagnosis.
Angina pectoris and chronic ischemic heart disease appear closer to the cancer diagnosis than varicose veins of lower extremities and hemorrhoids. From clinical evidence, these groups of heart disorders are generally treated very differently due to differing causes (16, 17), with the latter two being linked to obesity (18, 19). Additional obesity-linked diagnoses were identified for chapter XI (Diseases of the Digestive System) with hernia (six cancer types; ref. 20) and cholelithiasis (five cancer types; ref. 21) being to be most frequently shared among cancer types.
Longitudinal disease trajectories
We generated longitudinal cancer trajectories of variable lengths with noncancer diagnoses occurring sequentially before the cancer diagnosis (Fig. 3). The trajectories were built by joining the directional diagnosis pairs and can therefore model the longitudinal order of multiple diagnoses. Each trajectory illustrates the flow of a patient group that traverse the entire path of diagnoses, where the width represents the relative size of the patient group. We obtained 162 significant trajectories of length three (three directionally joined diagnoses) for seven cancer types (lung, prostate, skin, stomach, breast, diffuse NHL, and other NHL) and six trajectories of length four for one cancer type (lung cancer). We plotted significant length three trajectories for six of the cancer types: breast, prostate, skin, stomach, diffuse NHL, and other NHL. For illustrative purposes, we plotted length four trajectories for lung cancer (length three lung cancer trajectories can be found in Supplementary Table S4).
Precancer disease trajectories. Significant disease trajectories for cancer types: lung, breast, prostate, skin, stomach, and NHL (top left to bottom right). The trajectory sequence (from left to right) represents the temporal order in which the diseases were diagnosed, and the width of the trajectories represents the relative number of patients that follow the entire trajectory. We included only trajectories with at least 100 patients. Patients can participate in multiple trajectories and the number represents the sum of all patient cases for all trajectories within a cancer type. The colors of the nodes in the trajectories represent the ICD-10 chapters (see Fig. 1 for color legend).
Precancer disease trajectories. Significant disease trajectories for cancer types: lung, breast, prostate, skin, stomach, and NHL (top left to bottom right). The trajectory sequence (from left to right) represents the temporal order in which the diseases were diagnosed, and the width of the trajectories represents the relative number of patients that follow the entire trajectory. We included only trajectories with at least 100 patients. Patients can participate in multiple trajectories and the number represents the sum of all patient cases for all trajectories within a cancer type. The colors of the nodes in the trajectories represent the ICD-10 chapters (see Fig. 1 for color legend).
Figure 3 shows that the trajectories fall in different pathophysiologic classes represented by four dominating classes of trajectories overall. We found that several cardiovascular disease (CVD) diagnoses from chapter IX form one class of trajectories. This could, for instance, involve angina pectoris leading to chronic ischemic heart disease or tachycardia (lung, skin, prostate, and stomach). We furthermore observed acute myocardial infarction leading to chronic ischemic heart disease (lung, prostate, stomach, and diffuse and other NHL). A second class of trajectories are related to dysfunctions in the digestive system involving diagnoses such as gastritis and duodenitis, gastric ulcer, cholelithiasis, irritable bowel syndrome, and dyspepsia (lung, skin, and prostate). A third class has irregularities in menstruation followed by endometrioses or lump in breast followed by benign mammary dysplasia (skin and breast cancer) from chapter XIV. This class of trajectory relates to altered hormonal balances. Finally, the fourth class seems to involve diagnoses that are associated with symptoms of aging, even though age-related associations have been corrected for by our approach (Materials and Methods). For instance, this class could comprise of internal derangement of knee followed by gonarthrosis or senile cataract followed by other disorders of lens.
To investigate chronic immunocompromised diseases that might have had subclinical manifestations years before the onset of the first cancer diagnosis, we generated diagnosis trajectories using an a priori hypothesis approach. We aggregated diagnoses into grouped immune-related diagnoses (diseases related to immune system defects and infections), grouped hormone cancers, grouped infectious cancers, and grouped smoking-related cancer (Supplementary Table S1). Even though, few immune-defect trajectories were observed, we did identify trajectories that did not show up in the single-diagnosis analysis. For example, the grouped smoking-related cancers comprise two types of trajectories, one involving immune-defect diagnoses and the other CVDs. The patients following the CVD-related trajectory showed better survival rates than patients following the immune-defect trajectory (Supplementary Fig. S5).
Discussion
We conducted a systematic, retrospective population-wide analysis to map out all precancer disease histories, including disease pairs and trajectories. The method is based on traditional case–control analysis but compared with traditional epidemiologic studies, it can examine the global disease spectrum in a longitudinal manner beyond a limited set of hypotheses or diseases. The explorative method thus allows for the discovery of unexpected disease correlations. Temporality was used as a means to form hypotheses on cancer etiologies and suggest whether a diagnosis is more likely to be cancer-causing or indicative of reverse causation. The latter relies on the tenet that if a diagnosis resides temporally close to the cancer, it is more likely to reflect complications to an already established, but not-yet-diagnosed cancer.
From the Danish registry data spanning more than 20 years, we could uncover 3,526 significant directional precancer disease associations, of which, 648 single diseases were directly associated with one (or more) of 17 cancer types. By joining the directional disease pairs, we furthermore obtained 168 disease trajectories (162 length three and six length four) covering seven cancer types. Overall, the data suggests that frequently observed diagnoses and trajectories across cancer types are related to obesity, CVDs, or genitourinary disorders. Evidence in the literature show that inflammation has a key tumor-activating role in many types of disorders: dysbiosis (22), obesity (23), atherosclerosis (24, 25), or other inflammation-driven disorders (26, 27).
We identified a high prevalence of CVDs (1–5 years) prior to most cancer diagnoses. A recent study supports this finding by showing a higher prevalence of CVDs across different cancer types but for a selected cohort (patients with cancer often requiring cardiotoxic treatment) and without considering the longitudinal aspect (28). CVDs and cancer are known to share many risk factors (such as diet, obesity, and tobacco; ref. 29) and common pathways (30). We also observed obesity-related diagnoses, including hernia, diverticular disease, cholelithiasis, hemorrhoids, and varicose veins, in several cancer types that can be validated from previous studies (18, 20, 21, 31, 32). The actual obesity diagnosis is shared among two cancer types and type 2 diabetes mellitus is shared among three cancer types. These findings provide additional evidence in support of the contention that low-grade, systemic inflammation, linked to metabolic syndrome and obesity (33), could elevate the risk of certain cancer types. In the United States, approximately 20% of cancers are estimated to be related to weight gain and obesity (34).
We observe that genitourinary disorders associate with multiple types of cancers (not only hormone-regulated), manifest distant from the cancer diagnoses and have high RR scores. These observations suggest that an inflammatory state could elevate cancer risk. Endometriosis has been linked to abnormal high estrogen levels. Sparse evidence has associated it with higher risk of hormone-regulated cancer types (35, 36) without identifying an overall increase of cancer risk (37). In our study, we found that endometriosis is associated with breast, ovarian, skin, and lung cancer, with the latter two associations not previously reported. Recent findings show that endometriosis not only plays a role in hormone pathways but also immune-mediated pathways (38–41) and that anti-inflammatory treatment can inhibit endometriosis in mice (42, 43). We furthermore observed irregular menstruation and female/male infertility with extremely high RR scores. Irregular menstruation has previously been associated with an altered hormonal environment and polycystic ovarian syndrome (44), but limited evidence show that it can elevate cancer risk (45). Our findings rank irregular menstruation as the top shared diagnosis occurring in 11 cancer types.
A better understanding of the underlying pathophysiology linking diseases in various organs is important to provide a biological understanding for use to predict disease onset. Patients identified at excessive risk could be candidates for more intense screening, modification of behavioral patterns (e.g., smoking cessation), and the application of medical interventions that reduce the impact of the pathophysiologic trajectory they follow. One example is IL1β inhibition for patients with lung cancer that can benefit from more targeted therapies because these interventions are expensive and carry risk of fatal adverse consequences. Estimating the likelihood that the patient is on a pathophysiologic trajectory can improve the number of patients benefitting from treatment. Classifying risk phenotypes found here such as CVDs, obesity, or genitourinary disorders allows for the investigation of potential disease biomarkers that can be used to support early decision-making in the clinics. Moreover, exploring causes and mediators of inflammation can help identify novel cancer drugs through drug-repurposing strategies. Epidemiologic studies have linked the long-term use of nonsteroidal anti-inflammatory drugs, such as aspirin, to a lower incidence of certain cancer types due to their anti-inflammatory effect (46). A recent randomized study showed that by inhibiting IL1β, a key mediator of inflammation, the incidence of lung cancer could be reduced (24).
A limitation of the study is that the analyses are based on subjectively classified diagnoses codes and relies on the ICD-10 classification system. The strength of association thus depends on the accuracy of the given diagnoses, which can be influenced by the prevalence of the diagnoses and accuracy of the ICD-10 classification system. However, the accuracy of the diagnoses has been estimated quantitatively with the result that it is quite high as the registry describes a single-payer health care system with limited billing bias (47). Any misclassification from our approach should reduce the power to detect significant associations (false-positives), thus leading to a bias of a null result. Also, the registry does not contain diagnoses from general practitioners and correlations with these are therefore not included (although many of them are repeated in the hospital setting). We furthermore emphasize that the study relies on a discovery-based methodology to explore the global disease patterns and may pinpoint overlooked disease associations, which need to be supported by follow-up studies to confirm causality.
This is the first study to systematically map all precancer medical histories in a population-wide cancer and structured cohort. We suggest that the contribution of a broad span of diagnoses related to obesity, CVDs, or genitourinary disorders (endometriosis and irregular menstruation) could converge toward a common cancer risk factor being low-grade inflammation. Our results are made freely available and can be explored for future research or clinical purposes.
Disclosure of Potential Conflicts of Interest
S. Brunak has ownership interest (including stock, patents, etc.) in Intomics A/S, Hoba Therapeutics Aps, Novo Nordisk A/S, and Lundbeck A/S and is a consultant/advisory board member for Proscion A/S. No potential conflicts of interest were disclosed by the other authors.
Authors' Contributions
Conception and design: J.X. Hu, M. Helleberg, A.B. Jensen, S. Brunak, J. Lundgren
Development of methodology: J.X. Hu, A.B. Jensen, S. Brunak
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): J.X. Hu, M. Helleberg, A.B. Jensen, S. Brunak
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): J.X. Hu, M. Helleberg, S. Brunak, J. Lundgren
Writing, review, and/or revision of the manuscript: J.X. Hu, M. Helleberg, A.B. Jensen, S. Brunak, J. Lundgren
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): M. Helleberg, S. Brunak
Study supervision: M. Helleberg, S. Brunak, J. Lundgren
Acknowledgments
We would like to acknowledge the Novo Nordisk Foundation (grant agreement NNF14CC0001), as well as the Innovation Fund Denmark (grant agreement 5153-00002B) and Danish National Research Foundation (grant no. 126) for funding the research.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.