Abstract
Cancer antigen 125 (CA125) is the most promising ovarian cancer screening biomarker to date. Multiple studies reported CA125 levels vary by personal characteristics, which could inform personalized CA125 thresholds. However, this has not been well described in premenopausal women.
We evaluated predictors of CA125 levels among 815 premenopausal women from the New England Case Control Study (NEC). We developed linear and dichotomous (≥35 U/mL) CA125 prediction models and externally validated an abridged model restricting to available predictors among 473 premenopausal women in the European Prospective Investigation into Cancer and Nutrition Study (EPIC).
The final linear CA125 prediction model included age, race, tubal ligation, endometriosis, menstrual phase at blood draw, and fibroids, which explained 7% of the total variance of CA125. The correlation between observed and predicted CA125 levels based on the abridged model (including age, race, and menstrual phase at blood draw) had similar correlation coefficients in NEC (r = 0.22) and in EPIC (r = 0.22). The dichotomous CA125 prediction model included age, tubal ligation, endometriosis, prior personal cancer diagnosis, family history of ovarian cancer, number of miscarriages, menstrual phase at blood draw, and smoking status with AUC of 0.83. The abridged dichotomous model (including age, number of miscarriages, menstrual phase at blood draw, and smoking status) showed similar AUCs in NEC (0.73) and in EPIC (0.78).
We identified a combination of factors associated with CA125 levels in premenopausal women.
Our model could be valuable in identifying healthy women likely to have elevated CA125 and consequently improve its specificity for ovarian cancer screening.
Introduction
Ovarian cancer is the eighth leading cause of cancer death in 2012 with 151,900 deaths worldwide due, in part, to lack of specific symptoms leading to diagnosis at late stage when prognosis is poor (1, 2). More than 80% of ovarian cancer patients have elevated cancer antigen 125 (CA125), a membrane bound glycosylated mucin (MUC16), which is used clinically as a prognostic biomarker and to monitor response to therapy (3, 4). However, results from two large randomized screening trials in primarily postmenopausal women using transvaginal ultrasound and CA125 [either using 35 U/mL as a cutoff, or the risk of ovarian cancer algorithm (ROCA)] showed no clinically significant benefit (5, 6). MUC16 is expressed on a variety of tissues, including the lung, pancreas, stomach, liver, endometrium, and breast, and levels vary between individuals based on demographic, reproductive, and lifestyle characteristics (7–11). Therefore, identifying personal characteristics that are associated with CA125 levels could be used to create personalized thresholds for CA125 instead of a single 35 U/mL cutoff, thereby improving the interpretation of measured CA125 and its performance as a screening biomarker and ultimately leading to decreased ovarian cancer mortality.
However, prior studies examining factors associated with CA125 have focused on postmenopausal women (7, 8). Thus, we evaluated factors associated with CA125 in premenopausal women and developed and validated CA125 prediction models (linear and dichotomous) among premenopausal women without ovarian cancer from the New England Case Control Study (NEC) and The European Prospective Investigation into Cancer and Nutrition Study (EPIC).
Materials and Methods
Study population
The NEC is a population-based case–control study of ovarian cancer with 2,100 population-based controls enrolled from New Hampshire and eastern Massachusetts over the three study phases (1992–1997, 1998–2002, 2003–2008). Details on the study design have been described previously (12–14). Briefly, controls were identified using random-digit dialing, town book selection, and drivers’ license lists and frequency matched on age and state of residence. Approximately half (54%) of the eligible controls that were contacted agreed to participate. We restricted the study population to controls (n = 2,100) and excluded women without CA125 measurements (n = 96), women postmenopausal at time of blood draw (n = 1,130), women who had hysterectomy due to unknown menopausal status (n = 30), women who were pregnant or breastfeeding at time of blood draw (n = 25), and women with extreme CA125 values ranging from 115.3 to 411.7 U/mL (n = 4) identified based on the generalized extreme studentized deviated many-outlier detection approach applied to log-transformed values (15). In sum, our analysis included 815 premenopausal NEC controls.
The EPIC is a multicenter prospective cohort including participants from 10 Western European countries developed to evaluate the association between nutrition and cancer. Briefly, 519,978 participants (366,521 women) were enrolled between 1991 and 1998 across 23 research centers. Details on the study design have been described previously (16). A nested case–control study of ovarian cancer was designed within the cohort (17). For each ovarian cancer case, up to four controls were randomly selected using incidence density sampling for a total of 1,939 controls (17). We excluded women without CA125 measurements (n = 12), women who were either postmenopausal (n = 1,416), or had a hysterectomy or unknown menopausal status (n = 38). There were no outlying values in these EPIC controls. In sum, our analysis included a total of 473 premenopausal EPIC controls.
CA125 measurements
In NEC controls, we measured CA125 using the CA125II radioimmmunoassay (Centocor) at the CER Lab at Boston Children's Hospitals. We assessed the reproducibility of the assay by including five blinded aliquots of a uniform quality control pool in each of the 46 assay batches. The average of the coefficients of variation (CV) was 1%. In EPIC controls and in a subset of NEC controls, we previously measured CA125 using the volume-effective highly sensitive multiplex platform (meso scale discovery, MSD) in the Genital Tract Biology Laboratory at Brigham and Women's Hospital (17). The average CV across the assay batches was 19%.
Candidate predictors
We selected factors that have been previously reported to be associated with CA125 in at least one prior study (7–9, 11), ovarian cancer risk factors (18), and several factors which were biologically plausible to be associated with CA125 (10). Those included age at blood draw, race, body mass index (BMI, kg/m2), smoking status (never, former, current), pack-years calculated by number of packs of cigarettes per day multiplied by the number of years a person had smoked, age at menarche, oral contraceptive use and its duration (months), parity, self-reported endometriosis, tubal ligation, family history of ovarian cancer, prior personal cancer diagnosis, caffeine intake (mg), genital powder use, infertility, number of miscarriages, ectopic pregnancy, ever use of intrauterine device (IUD), fibroids, menstrual cycle regularity and days between last menstrual period (LMP), and blood draw (7–11). Furthermore, we evaluated additional variables related to menstrual characteristics and pregnancy timing: cycle length, days with menstrual bleeding, dysmenorrhea, age at first live birth, age at last live birth, and years since last live birth.
Statistical analyses
We log-transformed CA125 values to achieve a normal distribution. With this transformation, the distribution of log-transformed CA125 was normally distributed with skewness of 0.35 and kurtosis of 0.34, with a bell-shaped histogram.
Recalibration of CA125.
Since the EPIC samples had CA125 measured using an alternate assay (MSD assay) with a different scale, we used recalibration to rescale these measurement results to be comparable to the CA125II assay values. We recalibrated the EPIC CA125 values based on 187 NEC premenopausal controls with CA125 measurements on both CA125II and MSD assays using the drift correction method (19). We regressed the log-transformed MSD assay values to the log-transformed CA125II assay values and used the intercept and effect estimates of the model to calculate the recalibrated CA125II assay values based on the measured MSD assay values for all premenopausal controls in EPIC and used the recalibrated values in our analyses.
Predictors of CA125 in premenopausal women.
First, we calculated the geometric means and 95% confidence intervals (CI) of CA125 levels across candidate predictors and evaluated the association between individual candidate predictors and CA125 using linear or logistic regression adjusted for continuous age. We used effect estimates of the linear regression for each predictor to calculate the percent change in CA125 levels, calculated as [exp (β) − 1] × 100 for a 1 − unit change in the predictor. We determined the optimal modeling of continuous variables (age, BMI, age at menarche, duration of OC use, parity, and smoking pack-years) using restricted cubic splines to test for linearity (20). We used categorical variables for age, dichotomous variable for age at menarche, and piecewise linear spline with single knot for BMI because these variables were nonlinearly associated with log-transformed CA125. We created composite categorical variables for OC use and duration and smoking status and pack-years, and compared nested models using likelihood ratio test and non-nested models using the Akaike information criterion and Vuong test (21). Based on these evaluations, candidate predictors were modeled as follows: age at blood draw (categorical, by 10 year intervals from age 30), race (white, non-white), BMI (piecewise linear spline model with single knot at 27), height (continuous, centered at 165), smoking status(categorical, never, former, current) and pack-years (continuous, never smokers, pack-years among former smokers, pack-years among current smokers), age at menarche (age 12 and under, above 12), duration of OC use (continuous, including never users), parity (continuous), endometriosis (no, yes), tubal ligation (no, yes), family history of ovarian cancer (no, yes), prior personal cancer diagnosis (no, yes), caffeine intake (quartiles), genital powder use (no use, body use, genital use), infertility (no, yes), number of miscarriages (0, 1, 2, 3 or more), ectopic pregnancy (no, yes), IUD use (never, ever), fibroids (no, yes), menstrual cycle regularity (regular, irregular) and predicted phase of the menstrual cycle (early follicular, late follicular, peri-ovulatory, luteal, long cycle, irregular, missing) based on the number of days between the last menstrual period and blood draw.
Prediction modeling.
Overall, we developed CA125 prediction models (linear and dichotomous) in NEC and conducted external validation in EPIC (Fig. 1). We used cross-validation to conduct internal validation of the model developed in NEC. Because information on some of the predictors were not collected in EPIC, we developed an abridged model restricted to variables available in EPIC from the final model, and then validated the abridged model in EPIC.
Study design of the development and validation of the CA125 prediction model using the NEC and EPIC. We developed the CA125 prediction models (linear and dichotomous) using the NEC and conducted external validation using the EPIC.
Study design of the development and validation of the CA125 prediction model using the NEC and EPIC. We developed the CA125 prediction models (linear and dichotomous) using the NEC and conducted external validation using the EPIC.
Linear model.
First, we developed a linear CA125 prediction model of log-transformed CA125 in NEC. We used stepwise linear regression analysis using P < 0.15 as significance level for entry and retention in the model. In our primary prediction modeling, we used missing indicators for menstrual phase at blood draw due to a proportion of missing values. For variables with a limited number of missing values [age at menarche (n = 1), caffeine intake (n = 23), menstrual cycle length (n = 23)], women with missing values were excluded. Age was forced in the model and the r-squared was calculated for the final prediction model, adjusted for study phase (1992–1997, 1998–2003, 2003–2008) and center (Massachusetts, New Hampshire). In addition, we calculated a delta r-squared that excluded the variability explained by study phase and center as these were matching factors in NEC and were forced into the model (22). The predicted log-transformed CA125 values in NEC were calculated using the effect estimates from the final prediction model. We evaluated the performance of the model by calculating the Pearson correlation coefficient (r) to assess how well the predicted and the observed CA125 values agreed (i.e., calibration). We used five-fold cross-validation to assess for overfitting in NEC and calculated the average r-squared across all sampled datasets (23). To evaluate potential bias due to missing data of candidate predictors, we conducted a sensitivity analysis restricted to women who had no missing predictors. We also conducted multiple imputation by chained equations (MICE) to impute the missing variables conditional on all of the predictors and outcome (24). We allowed 100 iterations and generated 20 imputed datasets. We applied the final prediction model in the 20 imputed datasets using the methods described and pooled the results of the model estimates using the Rubin's rules (25).
For external validation, we sought to validate our linear CA125 prediction model in EPIC. However, some of our key predictors (endometriosis and fibroids) were not collected in EPIC or were missing in majority of women (tubal ligation); thus, we validated an abridged model restricted to variables available in EPIC. First, among the predictors selected in the final model developed in NEC, we identified predictors available both in NEC and EPIC. We next ran the abridged model in NEC restricting to those variables available in both NEC and EPIC. We used the effect estimates from this model to calculate the predicted value of log-transformed CA125 in the EPIC samples. We calculated the Pearson correlation coefficient between the predicted and the observed log-transformed CA125 to assess agreement and compare to that in the discovery dataset. We plotted the predicted versus the observed log-transformed CA125 for visual assessment.
Dichotomous model.
Next, we developed and validated a dichotomous prediction model of elevated CA125 defined as having CA125 ≥35 U/mL following the same method used for developing the linear CA125 prediction model described above but using logistic stepwise regression analysis. We evaluated the performance of the model by calculating the area under the curve (AUC) in the NEC (discovery) and EPIC (validation).
All statistical analyses were performed using SAS version 9.4 (SAS Institute Inc.), STATA version 12.1 (StataCorp.), and R version 3.4.3.
Results
Overall, the mean CA125 values were 17.3 U/mL in 815 NEC controls and 14.9 U/mL in 473 EPIC controls after recalibration. The baseline characteristics of NEC and EPIC premenopausal women were similar except age at blood draw, race, age at menarche, OC use, current hormone use, infertility, parity, and tubal ligation were significantly different (Supplementary Table S1).
Recalibration of CA125
We recalibrated the CA125 values in EPIC using the recalibration model based on 187 NEC premenopausal controls with both CA125II and MSD assay measurements. These two measurements were highly correlated (r = 0.96; 95% CI, 0.94–0.97). After recalibration, the measured and recalibrated CA125 values also showed high correlation (r = 0.95; 95% CI, 0.93–0.96). The recalibration model showed high performance in general.
Predictors of CA125 in premenopausal women
Age at blood draw was nonlinearly associated with CA125, with women younger than 30 or more than 50 years old having significantly lower CA125 than those ages 30 to 39 years (Table 1). In age-adjusted models, menstrual phase at blood draw was significantly associated with CA125 levels, with early follicular phase levels being 8% to 21% higher than in other menstrual phases. Endometriosis and fibroids were associated with significantly higher CA125 levels, with 21% and 13% difference, respectively, compared to those who did not have the condition. Current hormonal contraception use and tubal ligation were significantly associated with lower CA125 levels, with −16% and −11% difference, respectively. Cycle length, days with menstrual bleeding, dysmenorrhea, age at first live birth, age at last live birth, and years since last live birth were not significantly associated with CA125 levels in premenopausal women. Similar predictors were significantly associated with CA125 levels in the dichotomous model (Supplementary Table S2).
Association between predictors and CA125 levels in premenopausal women without ovarian cancer in the NEC (n = 815)
Variables . | N (%) . | Mean CA125 (95% CI)a . | Differences in CA125 levelsb . | P valueb . |
---|---|---|---|---|
Age, years | ||||
<30 | 69 (8) | 13 (11–14) | −20% | 0.001 |
30–39 | 215 (26) | 16 (15–17) | ref | |
40–49 | 374 (46) | 15 (15–16) | −5% | 0.20 |
50+ | 157 (19) | 15 (13–16) | −10% | 0.05 |
Ptrend | 0.65 | |||
Race | ||||
White | 778 (95) | 15 (15–16) | ref | |
Non-white | 37 (5) | 14 (12–16) | −10% | 0.20 |
BMI, kg/m2 | ||||
<20 | 76 (9) | 15 (14–17) | 1% | 0.90 |
20–<25 | 414 (51) | 15 (14–16) | ref | |
25–<30 | 206 (25) | 15 (14–16) | −1% | 0.77 |
30–<35 | 84 (10) | 16 (14–17) | 3% | 0.6 |
35+ | 35 (4) | 17 (14–20) | 13% | 0.16 |
Ptrend | 0.30 | |||
Height, cm | ||||
<160 | 182 (22) | 16 (14–17) | ref | |
160–<165 | 218 (27) | 14 (13–15) | −8% | 0.08 |
165–<170 | 231 (28) | 16 (15–17) | 1% | 0.85 |
170–<175 | 128 (16) | 15 (14–17) | −1% | 0.84 |
175+ | 56 (7) | 14 (12–16) | −8% | 0.25 |
Ptrend | 0.88 | |||
Smoking | ||||
Never | 421 (52) | 15 (14–16) | ref | |
Former | 269 (33) | 16 (15–17) | 5% | 0.19 |
Current | 125 (15) | 14 (13–15) | −7% | 0.16 |
Smoking status and duration | ||||
Never smokers | 421 (52) | 15 (14–16) | ref | |
<5 pack-years among former | 140 (17) | 16 (15–17) | 5% | 0.30 |
5–<15 pack-years among former | 79 (10) | 16 (15–18) | 8% | 0.20 |
15+ pack-years among former | 50 (6) | 15 (13–17) | 1% | 0.92 |
<5 pack-years among current | 23 (3) | 15 (12–19) | 1% | 0.93 |
5–<15 pack-years among current | 46 (6) | 14 (12–16) | −6% | 0.47 |
15+ pack-years among current | 56 (7) | 13 (12–15) | −11% | 0.11 |
Age at menarche, years | ||||
≤12 | 386 (47) | 16 (15–16) | ref | |
13+ | 428 (53) | 15 (14–15) | −6% | 0.09 |
Oral contraceptive use | ||||
Never | 180 (22) | 15 (14–16) | ref | |
Ever | 635 (78) | 15 (15–16) | 5% | 0.25 |
Duration of oral contraceptive use among ever users, years | ||||
<2 | 140 (22) | 17 (16–19) | ref | |
2–3 | 133 (21) | 15 (14–17) | −10% | 0.07 |
4–5 | 103 (16) | 15 (14–17) | −11% | 0.06 |
6–9 | 125 (20) | 15 (13–16) | −14% | 0.01 |
10+ | 134 (21) | 14 (13–16) | −16% | 0.005 |
Ptrend | 0.01 | |||
Current hormonal contraception usec | ||||
No | 427 (82) | 15 (15–16) | ref | |
Yes | 94 (18) | 13 (12–14) | −16% | 0.003 |
Menstrual phase at time of blood draw (days) | ||||
Early follicular (0–7) | 126 (15) | 17 (15–18) | ref | |
Late follicular (8–11) | 56 (7) | 16 (14–18) | −8% | 0.31 |
Peri-ovulatory (12–16) | 73 (9) | 15 (13–16) | −13% | 0.06 |
Luteal (17–35) | 155 (19) | 14 (13–16) | −14% | 0.01 |
Long cycle (36+) | 63 (8) | 13 (12–15) | −21% | 0.002 |
Irregular | 65 (8) | 13 (12–15) | −21% | 0.002 |
Missing | 277 (34) | 16 (15–17) | −8% | 0.15 |
Cause of infertility | ||||
None | 649 (80) | 15 (15–16) | ref | |
Male factor | 20 (2) | 17 (14–21) | 12% | 0.32 |
Tubal | 13 (2) | 15 (11–20) | −1% | 0.92 |
Endometriosis | 13 (2) | 23 (17–30) | 50% | 0.004 |
Ovulatory | 14 (2) | 14 (11–18) | −7% | 0.59 |
Unknown | 106 (13) | 14 (13–15) | −9% | 0.08 |
Number of miscarriages | ||||
0 | 625 (77) | 15 (15–16) | ref | |
1 | 137 (17) | 15 (13–16) | −5% | 0.29 |
2+ | 53 (7) | 14 (13–17) | −6% | 0.41 |
Ectopic pregnancy | ||||
No | 795 (98) | 15 (15–16) | ref | |
Yes | 20 (2) | 16 (13–20) | 5% | 0.67 |
Parity | ||||
0 | 214 (26) | 15 (14–16) | ref | |
1 | 140 (17) | 15 (14–16) | 1% | 0.82 |
2 | 264 (32) | 15 (14–16) | 3% | 0.53 |
3+ | 197 (24) | 15 (14–16) | 2% | 0.66 |
Tubal ligation | ||||
No | 674 (83) | 15 (15–16) | ref | |
Yes | 141 (17) | 14 (13–15) | −11% | 0.01 |
IUD use | ||||
Never | 689 (85) | 15 (14–16) | ref | |
Ever | 126 (15) | 16 (15–17) | 6% | 0.26 |
Unilateral oophorectomy | ||||
No | 808 (99) | 15 (15–16) | ref | |
Yes | 7 (1) | 19 (13–27) | 23% | 0.28 |
Endometriosis | ||||
No | 764 (94) | 15 (14–15) | ref | |
Yes | 51 (6) | 18 (16–21) | 21% | 0.01 |
Fibroids | ||||
No | 737 (90) | 15 (14–16) | ref | |
Yes | 78 (10) | 17 (15–19) | 13% | 0.04 |
Prior personal cancer diagnosis | ||||
No | 778 (95) | 15 (15–16) | ref | |
Yes | 37 (4) | 17 (15–20) | 15% | 0.10 |
Family history of ovarian cancer | ||||
No | 795 (98) | 15 (15–16) | ref | |
Yes | 20 (2) | 17 (13–21) | 10% | 0.40 |
Genital powder use | ||||
No use | 484 (59) | 15 (15–16) | ref | |
Body use | 157 (19) | 15 (14–16) | −5% | 0.30 |
Genital use | 174 (21) | 15 (14–16) | −3% | 0.57 |
Caffeine intake, mg | ||||
<70.1 | 198 (25) | 16 (14–17) | ref | |
70.1–<169.6 | 198 (25) | 14 (13–15) | −9% | 0.07 |
169.6–<348.7 | 198 (25) | 15 (14–17) | 0% | 0.97 |
348.7+ | 198 (25) | 16 (15–17) | 1% | 0.90 |
Ptrend | 0.44 |
Variables . | N (%) . | Mean CA125 (95% CI)a . | Differences in CA125 levelsb . | P valueb . |
---|---|---|---|---|
Age, years | ||||
<30 | 69 (8) | 13 (11–14) | −20% | 0.001 |
30–39 | 215 (26) | 16 (15–17) | ref | |
40–49 | 374 (46) | 15 (15–16) | −5% | 0.20 |
50+ | 157 (19) | 15 (13–16) | −10% | 0.05 |
Ptrend | 0.65 | |||
Race | ||||
White | 778 (95) | 15 (15–16) | ref | |
Non-white | 37 (5) | 14 (12–16) | −10% | 0.20 |
BMI, kg/m2 | ||||
<20 | 76 (9) | 15 (14–17) | 1% | 0.90 |
20–<25 | 414 (51) | 15 (14–16) | ref | |
25–<30 | 206 (25) | 15 (14–16) | −1% | 0.77 |
30–<35 | 84 (10) | 16 (14–17) | 3% | 0.6 |
35+ | 35 (4) | 17 (14–20) | 13% | 0.16 |
Ptrend | 0.30 | |||
Height, cm | ||||
<160 | 182 (22) | 16 (14–17) | ref | |
160–<165 | 218 (27) | 14 (13–15) | −8% | 0.08 |
165–<170 | 231 (28) | 16 (15–17) | 1% | 0.85 |
170–<175 | 128 (16) | 15 (14–17) | −1% | 0.84 |
175+ | 56 (7) | 14 (12–16) | −8% | 0.25 |
Ptrend | 0.88 | |||
Smoking | ||||
Never | 421 (52) | 15 (14–16) | ref | |
Former | 269 (33) | 16 (15–17) | 5% | 0.19 |
Current | 125 (15) | 14 (13–15) | −7% | 0.16 |
Smoking status and duration | ||||
Never smokers | 421 (52) | 15 (14–16) | ref | |
<5 pack-years among former | 140 (17) | 16 (15–17) | 5% | 0.30 |
5–<15 pack-years among former | 79 (10) | 16 (15–18) | 8% | 0.20 |
15+ pack-years among former | 50 (6) | 15 (13–17) | 1% | 0.92 |
<5 pack-years among current | 23 (3) | 15 (12–19) | 1% | 0.93 |
5–<15 pack-years among current | 46 (6) | 14 (12–16) | −6% | 0.47 |
15+ pack-years among current | 56 (7) | 13 (12–15) | −11% | 0.11 |
Age at menarche, years | ||||
≤12 | 386 (47) | 16 (15–16) | ref | |
13+ | 428 (53) | 15 (14–15) | −6% | 0.09 |
Oral contraceptive use | ||||
Never | 180 (22) | 15 (14–16) | ref | |
Ever | 635 (78) | 15 (15–16) | 5% | 0.25 |
Duration of oral contraceptive use among ever users, years | ||||
<2 | 140 (22) | 17 (16–19) | ref | |
2–3 | 133 (21) | 15 (14–17) | −10% | 0.07 |
4–5 | 103 (16) | 15 (14–17) | −11% | 0.06 |
6–9 | 125 (20) | 15 (13–16) | −14% | 0.01 |
10+ | 134 (21) | 14 (13–16) | −16% | 0.005 |
Ptrend | 0.01 | |||
Current hormonal contraception usec | ||||
No | 427 (82) | 15 (15–16) | ref | |
Yes | 94 (18) | 13 (12–14) | −16% | 0.003 |
Menstrual phase at time of blood draw (days) | ||||
Early follicular (0–7) | 126 (15) | 17 (15–18) | ref | |
Late follicular (8–11) | 56 (7) | 16 (14–18) | −8% | 0.31 |
Peri-ovulatory (12–16) | 73 (9) | 15 (13–16) | −13% | 0.06 |
Luteal (17–35) | 155 (19) | 14 (13–16) | −14% | 0.01 |
Long cycle (36+) | 63 (8) | 13 (12–15) | −21% | 0.002 |
Irregular | 65 (8) | 13 (12–15) | −21% | 0.002 |
Missing | 277 (34) | 16 (15–17) | −8% | 0.15 |
Cause of infertility | ||||
None | 649 (80) | 15 (15–16) | ref | |
Male factor | 20 (2) | 17 (14–21) | 12% | 0.32 |
Tubal | 13 (2) | 15 (11–20) | −1% | 0.92 |
Endometriosis | 13 (2) | 23 (17–30) | 50% | 0.004 |
Ovulatory | 14 (2) | 14 (11–18) | −7% | 0.59 |
Unknown | 106 (13) | 14 (13–15) | −9% | 0.08 |
Number of miscarriages | ||||
0 | 625 (77) | 15 (15–16) | ref | |
1 | 137 (17) | 15 (13–16) | −5% | 0.29 |
2+ | 53 (7) | 14 (13–17) | −6% | 0.41 |
Ectopic pregnancy | ||||
No | 795 (98) | 15 (15–16) | ref | |
Yes | 20 (2) | 16 (13–20) | 5% | 0.67 |
Parity | ||||
0 | 214 (26) | 15 (14–16) | ref | |
1 | 140 (17) | 15 (14–16) | 1% | 0.82 |
2 | 264 (32) | 15 (14–16) | 3% | 0.53 |
3+ | 197 (24) | 15 (14–16) | 2% | 0.66 |
Tubal ligation | ||||
No | 674 (83) | 15 (15–16) | ref | |
Yes | 141 (17) | 14 (13–15) | −11% | 0.01 |
IUD use | ||||
Never | 689 (85) | 15 (14–16) | ref | |
Ever | 126 (15) | 16 (15–17) | 6% | 0.26 |
Unilateral oophorectomy | ||||
No | 808 (99) | 15 (15–16) | ref | |
Yes | 7 (1) | 19 (13–27) | 23% | 0.28 |
Endometriosis | ||||
No | 764 (94) | 15 (14–15) | ref | |
Yes | 51 (6) | 18 (16–21) | 21% | 0.01 |
Fibroids | ||||
No | 737 (90) | 15 (14–16) | ref | |
Yes | 78 (10) | 17 (15–19) | 13% | 0.04 |
Prior personal cancer diagnosis | ||||
No | 778 (95) | 15 (15–16) | ref | |
Yes | 37 (4) | 17 (15–20) | 15% | 0.10 |
Family history of ovarian cancer | ||||
No | 795 (98) | 15 (15–16) | ref | |
Yes | 20 (2) | 17 (13–21) | 10% | 0.40 |
Genital powder use | ||||
No use | 484 (59) | 15 (15–16) | ref | |
Body use | 157 (19) | 15 (14–16) | −5% | 0.30 |
Genital use | 174 (21) | 15 (14–16) | −3% | 0.57 |
Caffeine intake, mg | ||||
<70.1 | 198 (25) | 16 (14–17) | ref | |
70.1–<169.6 | 198 (25) | 14 (13–15) | −9% | 0.07 |
169.6–<348.7 | 198 (25) | 15 (14–17) | 0% | 0.97 |
348.7+ | 198 (25) | 16 (15–17) | 1% | 0.90 |
Ptrend | 0.44 |
aGeometric mean adjusted for age.
bAge-adjusted.
cIncludes oral contraceptives and injections.
Linear CA125 prediction modeling
The final linear CA125 prediction model included age at blood draw, race, tubal ligation, endometriosis, menstrual phase at blood draw, and fibroids, with an r-squared of 0.07 (95% CI, 0.02–0.09; Table 2). The association between individual predictors and CA125 were similar in univariate and multivariate adjusted models. The r-squared of this full linear model when conducting five-fold cross-validation was 0.02. When we restricted the analysis to the 498 controls with complete information on all predictors and applied the final linear CA125 prediction model, the r-squared was 0.12 (95% CI, 0.05–0.15; Supplementary Table S3). When restricting to women with complete information on all predictors, the same variables were retained in the final model. For all the models, the delta r-squared, which subtracts the variance attributable to study phase and center from the total variance, was similar to the r-squared reported above. When evaluating the final continuous model in multiple imputed datasets in NEC, the β coefficients, standard errors, and the r-squared were similar to the original model (Supplementary Table S3). The small differences in the measures of association when running the final model in the dataset using missing indicators, dataset restricted to those with complete information on all potential predictors, and multiple imputed datasets suggest that the missingness of menstrual phase at blood draw do not largely influence the results. We also observed similar performance of the model when including all significant predictors in the univariate analyses, suggesting that the final model included important key predictors. Predicted log-transformed CA125 calculated based on the final model and the observed log-transformed CA125 were weakly correlated with a Pearson correlation coefficient of 0.26 (95% CI, 0.19–0.33; Fig. 2A).
Development and validation of linear CA125 prediction model in the NEC and EPIC. The predicted versus the observed log-transformed CA125 values were plotted, and Pearson correlation coefficient (r) was calculated to assess the performance of the linear CA125 prediction model in the NEC and EPIC. A, Linear CA125 prediction model performance in NEC. B, Abridged linear CA125 prediction model performance in NEC. C, Abridged linear CA125 prediction model performance in EPIC.
Development and validation of linear CA125 prediction model in the NEC and EPIC. The predicted versus the observed log-transformed CA125 values were plotted, and Pearson correlation coefficient (r) was calculated to assess the performance of the linear CA125 prediction model in the NEC and EPIC. A, Linear CA125 prediction model performance in NEC. B, Abridged linear CA125 prediction model performance in NEC. C, Abridged linear CA125 prediction model performance in EPIC.
For external validation, we developed an abridged linear CA125 prediction model which included age at blood draw, race, and menstrual phase at blood draw with r-squared of 0.05 (95% CI, 0.01–0.07) in NEC (Table 2). Using the beta coefficients from this abridged model, we calculated the predicted log-transformed CA125 values in EPIC. The predicted log-transformed CA125 values had a similar correlation with the observed log-transformed CA125 values in EPIC (r = 0.22; 95% CI, 0.13–0.31) as in the NEC abridged linear model (r = 0.22; 95% CI, 0.15–0.29; Fig. 2B and C). The spread of the predicted CA125 values in Fig. 2 are much smaller than the spread of the observed CA125 values because the linear prediction model only explains a small proportion of the total variance of the observed CA125 values.
Dichotomous CA125 prediction modeling
The final dichotomous prediction model to predict women with CA125 ≥35 U/mL included age at blood draw, tubal ligation, endometriosis, prior personal cancer diagnosis, family history of ovarian cancer, number of miscarriages, menstrual phase at blood draw, and smoking status and duration with an AUC of 0.83 (95% CI, 0.77–0.89; Table 3; Fig. 3). For menstrual phase at blood draw, we collapsed the other phase and irregular menstruation categories because few individuals had CA125 ≥35 U/mL in these groups. The association between individual predictors and CA125 were similar in univariate and multivariate adjusted models. The AUC of this full dichotomous model when conducting five-fold cross-validation was 0.67. When we restricted the analysis to the 498 controls with complete information on all predictors and applied the final dichotomous model, the AUC was 0.84 (95% CI, 0.76–0.93; Supplementary Table S4). When we conducted variable selection process using stepwise regression among women with complete information on all predictors, similar predictors were retained except number of miscarriages and smoking status, resulting with an AUC of 0.79 (95% CI, 0.69–0.89). When evaluating the model in the multiple imputed datasets in NEC, the odds ratios and the AUC were largely similar to the primary analysis (Supplementary Table S4). We also observed a similar performance of the model when including all significant predictors from the univariate analyses, suggesting that the final model included important key predictors. We also considered using 65 U/mL cutoff which has been proposed for premenopausal women (26), but were limited with five controls who had CA125 greater than 65 U/mL so were not able to investigate further.
Dichotomous CA125 prediction model of high CA125 (≥35 U/mL) in premenopausal women using stepwise regression in the NEC

Development and validation of dichotomous CA125 prediction model in the NEC and EPIC. Receiver operating characteristic (ROC) curves were plotted and the AUC was calculated to assess the discriminatory performance of the dichotomous CA125 prediction model in the NEC and EPIC. Dichotomous CA125 prediction model performance in NEC (solid line), abridged dichotomous CA125 prediction model performance in NEC (dashed line), abridged dichotomous CA125 prediction model performance in EPIC (dotted line).
Development and validation of dichotomous CA125 prediction model in the NEC and EPIC. Receiver operating characteristic (ROC) curves were plotted and the AUC was calculated to assess the discriminatory performance of the dichotomous CA125 prediction model in the NEC and EPIC. Dichotomous CA125 prediction model performance in NEC (solid line), abridged dichotomous CA125 prediction model performance in NEC (dashed line), abridged dichotomous CA125 prediction model performance in EPIC (dotted line).
For external validation, we developed an abridged model, which included age at blood draw, number of miscarriages, menstrual phase (collapsing those on hormones, blood draw at other phase, and having irregular menstruation due to power), and smoking status with an AUC of 0.73 (95% CI, 0.65–0.81) in NEC (Table 3; Fig. 3). When we applied this model to EPIC using recalibrated CA125 value of 35 U/mL as cutoff, the AUC was 0.78 (95% CI, 0.67–0.89; Fig. 3).
Discussion
This is the largest population-based study to develop and validate CA125 prediction models among healthy premenopausal women considering both continuous levels as well as those over current clinical threshold of 35 U/mL. Although, the model predicting continuous CA125 only explained a small percent of the total variability, the model did show comparable correlations between predicted and observed levels in EPIC, suggesting the validity of the model. Conversely, the AUC for predicting elevated CA125 (≥35 U/mL) was relatively high in NEC and validated in EPIC.
Age was nonlinearly associated with CA125 in our study, which is consistent with our prior study in EPIC in which we observed an inverse U-shaped association between age and CA125 levels among premenopausal women (9). Similarly, non-white race was associated with significantly lower CA125, which was consistent with prior studies in postmenopausal women (7, 8), suggesting the need for different thresholds for minority populations. Unfortunately, we were underpowered to evaluate differences in prediction models between racial subgroups, although others have described differences in CA125 levels between Black and Asian women (7, 8).
Factors related to menstruation were strongly related to CA125. Specifically, an early follicular phase blood draw was significantly associated with higher CA125 levels and strong predictor of CA125 in our final model, which was consistent with previous reports (27). This association is likely driven by MUC16 expression on the endometrium and endometrial shedding during early follicular phase, which may lead to higher circulating CA125 levels (10). This could explain the increased CA125 levels in women with fibroids, because fibroids are known to increase menstrual bleeding (28). In contrast, MUC16 expression on the endometrium may explain lower CA125 levels among women with a tubal ligation as this procedure would prevent retrograde menstruation, which occurs in approximately 85% of women during menstruation (29), leading to systemic exposure to the antigen. Factors related to infertility, particularly endometriosis, were also related to substantially higher CA125 levels, consistent with prior studies (30, 31). A similar mechanism is likely responsible as endometriosis leads to ectopic endometrial tissue usually in the peritoneal cavity.
Our linear CA125 prediction model explained 7% of the variability in CA125 but showed moderate validation in EPIC, whereas our dichotomous CA125 prediction model had better predictive ability with good validation. These results suggest that the variability of CA125 may be small in general but change dramatically by certain factors such as menstrual phase and endometriosis, and therefore the dichotomous prediction model performed better. We decided to use a standard log-linear model for developing the linear CA125 prediction model because the distribution of log-transformed CA125 was normally distributed with low kurtosis and skewness. When we included all significant predictors in the univariate analyses, both linear and dichotomous models showed similar performance compared with our final model having fewer predictors, suggesting some predictors were correlated.
Interestingly, some factors, such as fibroids and race were only significantly associated with continuous CA125 and some factors, such as prior personal cancer diagnosis and family history of ovarian cancer were only significantly associated with elevated CA125 (above 35 U/mL). We suspect more predictors were selected in the final dichotomous CA125 prediction model because the association between exposures and CA125 were nonlinear.
The major strength of our study is that we had two large independent population-based studies with detailed information on candidate predictors of CA125 to develop and validate CA125 prediction models among premenopausal women. However, there are several limitations to our study. First, we had missing data on several variables. Although we used missing indicators for our main analysis, our sensitivity analyses restricting to those with complete information on all predictors and using multiple imputation showed similar results, suggesting that the method for handling missing data did not influence overall results. In addition, we evaluated the performance of our prediction models using cross-validation and conducting external validation in an independent dataset, in which all the results were similar, suggesting a parsimonious model. Second, we were not able to validate the full prediction models in the independent dataset. Although we were only able to validate an abridged model in EPIC lacking tubal ligation and endometriosis, we expect the model performance to be better and closer to what we would have observed in NEC if we had information on all predictors. Third, our model could be missing unknown predictors of CA125 because we restricted the candidate predictors to those previously described, which were mostly conducted among postmenopausal women. The relatively low r-squared of the final linear CA125 prediction model suggest that other candidate predictors may exist, such as genetic factors, common medications, and dietary factors, opening new opportunities for future studies. Although hysterectomy has been previously described as a predictor of CA125, only few participants in NEC had hysterectomy. Given their ambiguous menopausal status we excluded them from current analysis of premenopausal women. Finally, the model performance in EPIC could be underestimated because NEC and EPIC used different assays to measure CA125. However, the CA125 values of the two assays were highly correlated (r = 0.96) and the predicted CA125 values calculated using the recalibration model were also very highly correlated with the observed CA125II assay values (r = 0.95).
In summary, we developed and validated CA125 prediction models among premenopausal women in two independent studies that further our understanding of factors that influence CA125 levels and can therefore be used to optimize ovarian cancer screening with CA125. Although performance of population-level screening for ovarian cancer in premenopausal women may be limited due to the lower incidence of ovarian cancer in this age range, approximately 30% of ovarian cancers are diagnosed before age 55. Furthermore, the impact of ovarian cancer in younger women results in potentially greater social, emotional, and economic impact. Further studies are needed to identify new predictors of CA125 to improve the model and to understand the predictors of changes in CA125 over time based on personal characteristics.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: N. Sasamoto, A. Babic, A. Tjønneland, K. Overvad, H. Boeing, R. Tumino, E. Weiderpass, A. Barricarte, M. Dorronsoro, E. Lundin, R. Kaaks, S.S. Tworoger, K.L. Terry
Development of methodology: N. Sasamoto, R.N. Fichorova, A. Tjønneland, A. Barricarte, S.S. Tworoger, K.L. Terry
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): R.T. Fortner, A.F. Vitonis, H. Yamamoto, R.N. Fichorova, A. Tjønneland, H. Boeing, A. Trichopoulou, E. Peppa, A. Karakatsani, D. Palli, V. Pala, A. Mattiello, R. Tumino, C.C. Grasso, J. Ramón Quirós, M. Rodríguez-Barranco, S. Colorado-Yohar, A. Barricarte, A. Idahl, E. Lundin, K.-T. Khaw, T.J. Key, R. Kaaks, D.W. Cramer, K.L. Terry
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): N. Sasamoto, A. Babic, B.A. Rosner, R.T. Fortner, H. Yamamoto, R.N. Fichorova, M. Kvaskoff, E. Lundin, S.S. Tworoger, K.L. Terry
Writing, review, and/or revision of the manuscript: N. Sasamoto, A. Babic, B.A. Rosner, R.T. Fortner, H. Yamamoto, R.N. Fichorova, A. Tjønneland, L. Hansen, M. Kvaskoff, A. Fournier, F.R. Mancini, H. Boeing, E. Peppa, A. Karakatsani, D. Palli, V. Pala, A. Mattiello, R. Tumino, C.C. Grasso, N.C. Onland-Moret, E. Weiderpass, J. Ramón Quirós, L. Lujan-Barroso, M. Rodríguez-Barranco, S. Colorado-Yohar, A. Barricarte, A. Idahl, E. Lundin, H. Sartor, K.-T. Khaw, T.J. Key, D. Muller, E. Riboli, M.J. Gunter, L. Dossus, D.W. Cramer, S.S. Tworoger, K.L. Terry
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): N. Sasamoto, A.F. Vitonis, H. Boeing, J. Ramón Quirós, K.-T. Khaw, M.J. Gunter, R. Kaaks, K.L. Terry
Study supervision: R. Tumino, M.J. Gunter, K.L. Terry
Acknowledgments
The authors would like to acknowledge the participants and staff of the NEC and the EPIC. Research reported in this publication was specifically supported by the NCI of the NIH under the award number R01CA193965 (awarded to K.L. Terry), and R01CA158119 and R35CA197605 (awarded to D.W. Cramer). The coordination of EPIC is financially supported by the European Commission (DG-SANCO) and the International Agency for Research on Cancer. The national cohorts are supported by Danish Cancer Society (Denmark); Ligue Contre le Cancer, Institut Gustave Roussy, Mutuelle Générale de l’Education Nationale, Institut National de la Santé et de la Recherche Médicale (INSERM; France); German Cancer Aid, German Cancer Research Center (DKFZ), Federal Ministry of Education and Research (BMBF), Deutsche Krebshilfe, Deutsches Krebsforschungszentrum and Federal Ministry of Education and Research (Germany); the Hellenic Health Foundation (Greece); Associazione Italiana per la Ricerca sul Cancro-AIRC-Italy and National Research Council (Italy); Dutch Ministry of Public Health, Welfare and Sports (VWS), Netherlands Cancer Registry (NKR), LK Research Funds, Dutch Prevention Funds, Dutch ZON (Zorg Onderzoek Nederland), World Cancer Research Fund (WCRF), Statistics Netherlands (the Netherlands); ERC-2009-AdG 232997 and Nordforsk, Nordic Centre of Excellence programme on Food, Nutrition and Health (Norway); Health Research Fund (FIS), PI13/00061 to Granada; PI13/01162 to EPIC-Murcia, Regional Governments of Andalucía, Asturias, Basque Country, Murcia and Navarra, ISCIII RETIC (RD06/0020; Spain); Swedish Cancer Society, Swedish Research Council and County Councils of Skåne and Västerbotten, The Cancer Research Foundation of Northern Sweden (Sweden); Cancer Research UK (14136 to EPIC-Norfolk; C570/A16491 and C8221/A19170 to EPIC-Oxford), Medical Research Council (1000143 to EPIC-Norfolk, MR/M012190/1 to EPIC-Oxford, United Kingdom).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.