Abstract
Disease risk–associated single nucleotide polymorphisms (SNP) identified from genome-wide association studies have the potential to be used for disease risk prediction. An important feature of these risk-associated SNPs is their weak individual effect but stronger cumulative effect on disease risk. Several approaches are commonly used to model the combined effect in risk prediction, but their performance is unclear. We compared two methods to model the combined effect of 14 prostate cancer risk–associated SNPs and family history for the estimation of absolute risk for prostate cancer in a population-based case-control study in Sweden (2,899 cases and 1,722 controls). Method 1 weighs each risk allele equally using a simple method of counting the number of risk alleles, whereas method 2 weighs each risk SNP differently based on its odds ratio. We found considerable differences between the two methods. Absolute risk estimates from method 1 were generally higher than those of method 2, especially among men at higher risk. The difference in the overall discriminative performance, measured by area under the curve of the receiver operating characteristic, was small between method 1 (0.614) and method 2 (0.618), P = 0.20. However, the performance of these two methods in identifying high-risk individuals (2- or 3-fold higher than average risk), measured by positive predictive values, was higher for method 2 than for method 1. These results suggest that method 2 is superior to method 1 in estimating absolute risk if the purpose of risk prediction is to identify high-risk individuals. Cancer Epidemiol Biomarkers Prev; 19(4); 1083–8. ©2010 AACR.
Introduction
Genome-wide association studies (GWAS) have led to the discovery of more than two dozen genetic variants that are associated with prostate cancer risk (1-12). These genetic variants are common in the general population of European descent, and associations with prostate cancer risk are consistently observed in multiple studies. Although each of these variants is only moderately associated with prostate cancer risk, collectively they have a stronger, dose-dependent association with prostate cancer risk (13-15). The discovery of a large number of risk-associated genetic variants, compared with only three previously known risk factors for prostate cancer (age, race, and family history), represents a major breakthrough in risk profiling and may improve the ability to predict an individual's risk for prostate cancer. Such risk prediction is an important step towards the overall goals of personalized medicine and allows for the identification of high-risk individuals for prevention, screening, and early diagnosis.
Absolute risk is an informative measurement of the probability of developing a disease at a specific age and can be easily interpreted by physicians and patients. Two methods are commonly used to estimate absolute risk when genetic variants are included as predictors. One method treats each risk allele equally and uses a simple method of counting the number of risk alleles. Another method weighs each risk single nucleotide polymorphism (SNP) differently based on its individual odds ratio (OR). The relative performance of these two methods in estimating absolute risk of disease is unclear. Herein, we compare the absolute risk estimates of these two methods when the same genetic variants are used.
Subjects and Methods
Study population
A large population-based prostate cancer case-control study in Sweden named CAncer of the Prostate in Sweden (CAPS) was used to develop a risk prediction model. CAPS has been described in detail elsewhere (13). Briefly, prostate cancer patients in CAPS were identified and recruited from regional cancer registries in Sweden (all Caucasians). The inclusion criterion for case subjects was pathologically or cytologically verified adenocarcinoma of the prostate, diagnosed between July 2001 and October 2003. DNA samples from blood and tumor-node-metastasis stage, Gleason grade (biopsy), and prostate-specific antigen (PSA) levels at diagnosis were available for 2,899 patients. Control subjects were recruited concurrently with case subjects. They were randomly selected from the Swedish Population Registry, and matched according to the expected age distribution of cases (groups of 5-y intervals) and geographic region. DNA samples from blood were available for 1,722 control subjects. Positive family history was defined as any first- or second-degree relatives with a diagnosis of prostate cancer. The research ethics committees at Wake Forest University School of Medicine and the Karolinska Institute approved the study.
Selection of SNPs
We selected 14 SNPs discovered in four prostate cancer GWAS and follow-up fine mapping studies reported before June 2009 (Supplementary Table S1; refs. 1-8, 16-18). All of these SNPs were selected from GWAS that were based on Caucasian populations. These included three SNPs at 8q24 (16, 17), two at 17q12 (18), and one each at 3p12, 7p15, 7q21, 9q33, 10q11, 11q13, 17q24, 22q13, and Xp11 (1-8). The SNP rs2735839 in the KLK3 gene at 19q13 was not included because of a concern for possible PSA detection bias (19). These 14 SNPs were genotyped in CAPS using the MassARRAY QGE iPLEX system (Sequenom, Inc.). Two duplicate test samples and two blinded water samples were included in each 96-well plate. The average genotype call rate was 98.3% and the concordance rate was 99.8%. All SNPs were in Hardy-Weinberg equilibrium (P > 0.05).
Statistical analyses
Two methods were used to estimate absolute risk for prostate cancer for each individual based on that individual's genotype at these 14 risk-associated SNPs and family history. They differed primarily in the form of estimating the combined effect of these 14 SNPs and family history. In the first method (method 1), the combined effect of these 14 SNPs was modeled cumulatively by treating each risk allele equally and simply counting the number of risk alleles (14). We first counted the number of risk alleles of these 14 SNPs of each subject and then classified them into eight approximately equally sized groups (≤7, 8, 9, 10, 11, 12, 13, and ≥14 number of risk alleles). Thus, only one variable with eight categories was created for the cumulative effect of the SNPs. ORs for the number of risk alleles (eight categorical variables) and family history (yes or no) were estimated from a logistic regression model with men who had 11 risk alleles (mode) and negative family history of prostate cancer serving as the reference group. OR is a measure describing the strength of association between two binary data values. Here it was used to explore the association between prostate cancer and number of risk alleles (each categorized group versus the reference group) and family history. The absolute risk of developing prostate cancer between ages 55 and 74 y was then estimated for each man based on the OR of his status of number of risk alleles and family history, calibrated incidence rate of prostate cancer, and mortality rate for all causes excluding prostate cancer in Sweden (20). The calibrated incidence rate was needed to infer the incidence rate estimate for men without a family history based on the population incidence rate in Sweden (21), which includes men with and without family history. It was calculated based on the attributable risk of family history that was estimated from the CAPS and population incidence rates in Sweden, using a method described by Chen et al. (22).
In the second method (method 2), the combined effect of these 14 SNPs and family history was modeled by multiplying the OR of each individual risk SNP and family history, as described in a simple multiplicative (log-additive) model by Pharoah et al.(23). We briefly describe how the absolute risk was obtained in the following four steps: (a) the allelic OR assuming an additive model for each SNP was first estimated in CAPS using a logistic regression model; (b) a multiplicative model assuming no interaction was used to derive genotype relative risks from the allelic OR; (c) for each of the three genotypes at each SNP, we converted the genotype relative risk to the risk relative to the average risk in the population (23); and (d) we derived the overall risk relative to the population by multiplying the risks relative to the population of all SNPs as well as the family history of the individual. The absolute risk for each man was then estimated based on the overall risk relative to the population, the incidence rate of prostate cancer in the general population, and the mortality rate for all causes excluding prostate cancer in Sweden.
We used two methods to compare the absolute risk estimates from methods 1 and 2. We used the Spearman's rank correlation coefficient to assess the consistency of estimated absolute risk among study subjects between the two methods. We also used Kappa statistics to compare agreement between the two methods and to correct for chance agreement.
We used area under the curve (AUC) statistics of the receiver operating characteristic to assess the overall performance of estimated absolute risk in discriminating prostate cancer cases and controls. The receiver operating characteristic is a plot of the sensitivity versus (1-specificity) of classifying prostate cancer at various thresholds. AUC quantifies the overall ability to discriminate between those who have the disease and those who do not have the disease and ranges from 0.5 (useless) to 1 (perfect). A nonparametric approach developed by Delong and colleagues was used to test for equality of the AUCs (24). The analysis was carried out using Stata software, version 8.2. We also assessed the performance of estimated absolute risk at specific cutoff values using sensitivity, specificity, and positive predictive value (PPV), where PPV was calculated based on sensitivity, specificity, and prevalence using Bayes' theorem. A range of prevalences – 0.1, 0.15, and 0.2 – was used (25).
Results and Discussion
The absolute risk of developing prostate cancer between ages 55 and 74 years was estimated for each subject in CAPS using method 1 and method 2, respectively, and is presented in Fig. 1. There were 16 distinct values of absolute risk estimates derived from method 1 because the subjects fell into 16 risk groups: 8 groups of number of risk alleles (≤7, 8, 9, 10, 11, 12, 13, and ≥14) and 2 groups of family history (yes or no; Fig. 1A). The estimated absolute risks ranged from as low as 0.08 to as high as 0.52. Although the vast majority of subjects had an absolute risk at or near the average risk of 0.11, 26% of men had an absolute risk that was >2-fold the average risk. The absolute risk estimates derived from method 2 were continuous (Fig. 1B). Again, although the vast majority of men had an absolute risk that was at or near the average risk, 10% of subjects had absolute risk that was >2-fold the average risk. Interestingly, one subject had an absolute risk estimate of 1.42 using method 2. Although the actual risk (probability) for prostate cancer cannot exceed 1, it is numerically possible to have an absolute risk estimate of >1 in method 2 in rare situations in which subjects have many risk alleles. For example, only one subject in our study had an absolute risk estimate >1 (he inherited 19 risk alleles of these 14 SNPs) and no other subjects had an absolute risk estimate >0.8. This outlier was removed from further analysis in this study. In addition, we have provided the relative risks and corresponding absolute risks in Supplementary Table S2. For example, the relative risk is approximately 4 when the absolute risk is >0.5.
The absolute risks for the same subjects estimated from these two different methods are presented in a scatter plot (Fig. 2). The correlation coefficient (r2) of absolute risk between the two methods was estimated to be 0.941 (95% confidence interval, 0.938-0.944). For men in each of the 16 risk groups based on method 1, a large amount of variation was observed in the absolute risk estimates from method 2. The mean absolute risk estimates were all higher in method 1 than in method 2 for each of the 16 risk groups (Table 1). Assuming method 2 is more accurate because exact estimates of OR of each predictor was used in the prediction model, this comparison suggests an upward bias in estimating absolute risk using method 1 in this specific example. When we examined the categorical concordance of men classified as having high risk of prostate cancer, defined by absolute risk of >2-fold (≥0.22) or 3-fold (≥0.33) of average risk, the Kappa statistics were estimated to be 0.56 and 0.74, respectively. Overall, these results suggest considerable differences between the two methods in estimating absolute risk and in defining high-risk individuals for prostate cancer.
Group . | No. of subjects . | Absolute risk . | ||
---|---|---|---|---|
Method 1 . | Method 2 . | |||
Mean (SD) . | Range . | |||
FH-, ≤7 alleles | 336 | 0.08 | 0.05 (0.01) | 0.02-0.11 |
FH-, 8 alleles | 341 | 0.08 | 0.06 (0.01) | 0.05-0.15 |
FH-, 9 alleles | 441 | 0.10 | 0.07 (0.01) | 0.06-0.13 |
FH-, 10 alleles | 585 | 0.11 | 0.09 (0.01) | 0.08-0.17 |
FH-, 11 alleles | 605 | 0.11 | 0.10 (0.01) | 0.07-0.2 |
FH-, 12 alleles | 577 | 0.12 | 0.12 (0.02) | 0.09-0.28 |
FH-, 13 alleles | 428 | 0.15 | 0.14 (0.02) | 0.1-0.32 |
FH-, ≥14 alleles | 595 | 0.24 | 0.19 (0.06) | 0.11-0.84 |
FH+, ≤7 alleles | 48 | 0.17 | 0.10 (0.02) | 0.05-0.19 |
FH+, 8 alleles | 53 | 0.18 | 0.13 (0.02) | 0.1-0.24 |
FH+, 9 alleles | 80 | 0.22 | 0.15 (0.02) | 0.13-0.22 |
FH+, 10 alleles | 84 | 0.23 | 0.18 (0.02) | 0.15-0.35 |
FH+, 11 alleles | 97 | 0.23 | 0.21 (0.03) | 0.16-0.32 |
FH+, 12 alleles | 119 | 0.26 | 0.24 (0.04) | 0.17-0.39 |
FH+, 13 alleles | 98 | 0.32 | 0.28 (0.04) | 0.21-0.45 |
FH+, ≥14 alleles | 135 | 0.52 | 0.37 (0.12) | 0.24-1.42 |
Group . | No. of subjects . | Absolute risk . | ||
---|---|---|---|---|
Method 1 . | Method 2 . | |||
Mean (SD) . | Range . | |||
FH-, ≤7 alleles | 336 | 0.08 | 0.05 (0.01) | 0.02-0.11 |
FH-, 8 alleles | 341 | 0.08 | 0.06 (0.01) | 0.05-0.15 |
FH-, 9 alleles | 441 | 0.10 | 0.07 (0.01) | 0.06-0.13 |
FH-, 10 alleles | 585 | 0.11 | 0.09 (0.01) | 0.08-0.17 |
FH-, 11 alleles | 605 | 0.11 | 0.10 (0.01) | 0.07-0.2 |
FH-, 12 alleles | 577 | 0.12 | 0.12 (0.02) | 0.09-0.28 |
FH-, 13 alleles | 428 | 0.15 | 0.14 (0.02) | 0.1-0.32 |
FH-, ≥14 alleles | 595 | 0.24 | 0.19 (0.06) | 0.11-0.84 |
FH+, ≤7 alleles | 48 | 0.17 | 0.10 (0.02) | 0.05-0.19 |
FH+, 8 alleles | 53 | 0.18 | 0.13 (0.02) | 0.1-0.24 |
FH+, 9 alleles | 80 | 0.22 | 0.15 (0.02) | 0.13-0.22 |
FH+, 10 alleles | 84 | 0.23 | 0.18 (0.02) | 0.15-0.35 |
FH+, 11 alleles | 97 | 0.23 | 0.21 (0.03) | 0.16-0.32 |
FH+, 12 alleles | 119 | 0.26 | 0.24 (0.04) | 0.17-0.39 |
FH+, 13 alleles | 98 | 0.32 | 0.28 (0.04) | 0.21-0.45 |
FH+, ≥14 alleles | 135 | 0.52 | 0.37 (0.12) | 0.24-1.42 |
Abbreviations: FH-, without family history; FH+, with family history.
Finally, we assessed the performance of these two risk prediction methods in discriminating cases and controls in CAPS. We first compared the overall performance of these two methods in correctly discriminating case and control status using the AUC statistics. The AUC for method 2 (0.618) was slightly higher than that for method 1 (0.614), although the difference was not statistically significant (P = 0.20). Furthermore, considering the primary utility of the risk prediction model is to identify men at considerably elevated risk for prostate cancer, we then compared the predictive performance of these two methods at two specific cutoff values of absolute risk: 2-fold and 3-fold of average risk. The sensitivity, specificity, and PPV of these two methods are presented in Table 2. Method 2 had 0.03 to 0.04 higher PPV when using 2-fold or 3-fold as the cutoff value, suggesting this method is more accurate than method 1in predicting prostate cancer.
a) Method 1 . | |||||
---|---|---|---|---|---|
Absolute risk . | Sensitivity . | Specificity . | PPV (0.10)* . | PPV (0.15)* . | PPV (0.20)* . |
0.11 | 0.56 | 0.60 | 0.14 | 0.20 | 0.26 |
0.22 | 0.32 | 0.84 | 0.18 | 0.26 | 0.33 |
0.33 | 0.04 | 0.99 | 0.35 | 0.46 | 0.54 |
b) Method 2 | |||||
Absolute risk | Sensitivity | Specificity | PPV (0.10)* | PPV (0.15)* | PPV (0.20)* |
0.11 | 0.61 | 0.54 | 0.13 | 0.19 | 0.25 |
0.22 | 0.16 | 0.93 | 0.21 | 0.30 | 0.37 |
0.33 | 0.05 | 0.99 | 0.38 | 0.49 | 0.58 |
a) Method 1 . | |||||
---|---|---|---|---|---|
Absolute risk . | Sensitivity . | Specificity . | PPV (0.10)* . | PPV (0.15)* . | PPV (0.20)* . |
0.11 | 0.56 | 0.60 | 0.14 | 0.20 | 0.26 |
0.22 | 0.32 | 0.84 | 0.18 | 0.26 | 0.33 |
0.33 | 0.04 | 0.99 | 0.35 | 0.46 | 0.54 |
b) Method 2 | |||||
Absolute risk | Sensitivity | Specificity | PPV (0.10)* | PPV (0.15)* | PPV (0.20)* |
0.11 | 0.61 | 0.54 | 0.13 | 0.19 | 0.25 |
0.22 | 0.16 | 0.93 | 0.21 | 0.30 | 0.37 |
0.33 | 0.05 | 0.99 | 0.38 | 0.49 | 0.58 |
*PPV (assumed prevalence).
There are a number of limitations in these analyses. First, several newly reported prostate cancer risk–associated SNPs were not included in the risk prediction model (9-12). The omission of these SNPs may affect the estimates of absolute values of risk. Second, although we excluded the SNP (rs2735839) in the KLK3 gene, other reported prostate cancer–risk associated SNPs that were included in the risk prediction model may also be influenced by PSA detection bias (19, 26). The inclusion of SNPs influenced by PSA detection bias may affect the validity of risk prediction. Note that for sensitivity analysis, we included SNP rs2735839 in the analysis and the result was similar. Third, we used the OR estimated from the CAPS study population in risk prediction of CAPS subjects and then assessed the performance of risk prediction in the same study population. This circular approach likely led to an upward bias in the estimates of predictive performance. Finally, we used PPV to assess prediction performance in this case-control study. It is well known that PPV should be estimated from cohort studies because PPV is sensitive to the prevalence of diseases. Here we overcome this issue by using Bayes's theorem and assuming a range of reasonable prevalences in the calculation. In addition, the emphasis in this study is to compare the PPVs between the two methods, not the absolute values of the PPVs. Altogether, these limitations will likely affect the absolute values of prostate cancer risk estimates and statistics of predictive performance. However, they are not likely to have a substantial impact on the interpretation of these results because the primary focus of this study is to compare the difference between the two risk prediction methods, where both suffered from these limitations.
With these caveats, the results from this study provide needed information on the differences of these two commonly used methods in the estimation of prostate cancer absolute risk and in predictive and discriminative performance. These results indicate considerable differences between the two methods in terms of absolute risk estimates for the same individuals, even though the same SNPs are included in both methods. It is particularly interesting to find that absolute risk estimates from method 1 (weighing each risk allele equally) were generally higher than those of method 2 (weighing each risk SNP differently), especially for men at higher risk. However, results from this study indicate that the difference in overall discriminative performance between the two methods was small. The AUC statistics of these two methods were not statistically different and were in essence the same. These seemingly contradictory observations may be explained by the generally low discriminative ability of both risk prediction methods for the vast majority of men because they are at or near the average risk for prostate cancer. The predictive performance of these two methods in identifying high-risk individuals, on the other hand, differed considerably. For example, the predictive performance, measured by PPV, was 0.03 to 0.04 higher for method 2 than for method 1 among men who had 2-fold or 3-fold higher than average risk. Overall, these results suggest that method 2 is superior to method 1 in estimating absolute risk, especially if the purpose of risk prediction is to identify high-risk individuals. It is noted, however, that the point estimate of absolute risk from method 2 may not be reliable among men with an extremely elevated risk. For example, we found one subject who had an estimated absolute risk of >1 using method 2, because he carried most of the risk alleles of these SNPs. This problem may be more pronounced when the risk prediction model includes a larger number of SNPs.
Risk prediction for common diseases using risk-associated SNPs identified from GWAS has received a great deal of attention recently. An important feature of these SNPs is their weak individual effect but stronger cumulative effect on disease risk. Currently, two approaches are commonly used to model the combined effect, but their performance in estimating absolute risk of disease is unclear. Results from this study provide important information to address this question. It is important to note that the conclusions of this study may be influenced by the prevalence of the disease under study, the number of SNPs used in the model, and the characteristics of SNPs such as frequency and OR of each risk allele. Therefore, additional studies, especially large population-based cohort studies, are needed to further evaluate the performance of risk prediction using SNPs and the method used to assess their cumulative effect.
Disclosure of Potential Conflicts of Interest
A patent application for using a combination of the first five prostate cancer risk–associated SNPs risk prediction was filed by Wake Forest University School of Medicine, Karolinska Institutet, and Johns Hopkins University School of Medicine.
Acknowledgments
We thank all of the study subjects who participated in the CAPS Study and the urologists who provided their patients to the CAPS Study. We acknowledge the contribution of multiple physicians and researchers in designing and recruiting study subjects, including Dr. Hans-Olov Adami.
Grant Support: National Cancer Institute (CA129684, CA105055, CA106523, CA95052 to J. Xu, CA112517, CA58236 to W.B. Isaacs).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.