Abstract
Prostate cancer susceptibility is a polygenic trait. We aimed to examine the controversial diagnostic utility of single-nucleotide polymorphisms (SNP) for prostate cancer. We analyzed two datasets collected from Europeans and one from Africans. These datasets were generated by the genome-wide association studies, that is, CGEMS, BPC3, and MEC-Africans, respectively. About 540,000 SNPs, including 61 risk markers that constitute a panel termed MK-61, were commonly genotyped. For each dataset, we augmented the MK-61 panel to generate an MK-61+ one by adding several thousands of SNPs that were moderately associated with prostate cancer occurrence in external dataset(s). We assessed the diagnostic utility of both panels by measuring their predictive strength for prostate cancer occurrence with AUC statistics. We calculated the theoretical AUCs using quantitative genetics model-based formulae and obtained the empirical estimates via 10-fold cross-validation using statistical and machine learning techniques. For the MK-61 panel, the 95% confidence intervals of the theoretical AUCs (AUC-CI.95) were 0.578–0.655, 0.596–0.656, and 0.539–0.596 in the CGEMS, BPC3, and MEC-Africans cohorts, respectively. For the MK-61+ panels, the corresponding AUC-CI.95 were 0.617–0.663, 0.527–0.736, and 0.547–0.565. The empirical AUCs largely fell within the theoretical interval. A promising result (AUC = 0.703, FNR = 0.354, FPR = 0.353) was obtained in the BPC3 cohort when the MK-61+ panel was used. In the CGEMS cohort, the MK-61+ panel complemented PSA in predicting the disease status of PSA ≥ 2.0 ng/mL samples. This study demonstrates that augmented risk SNP panels can enhance prostate cancer prediction for males of European ancestry, especially those with |{\rm{PSA}} \ge 2.0\ $|ng/mL.
This study demonstrates that augmented risk SNP panels can enhance prostate cancer prediction for males of European ancestry, especially those with PSA ≥ 2 ng/mL.
Introduction
Prostate cancer is the most commonly diagnosed non-skin cancer worldwide for males and the second leading cause of cancer-related death in American men (1–3). Early identification of males at high risk of prostate cancer is critical for managing the disease and reducing mortality rate (4). At present, prostate-specific antigen (PSA) level in serum is still the best biomarker for the detection of prostate cancer even though blood PSA testing has poor specificity and limited sensitivity (5–8). Over the past two decades, the appropriate clinical application of PSA testing remains a debatable topic, which has been motivating the cancer research community to explore additional factors, such as genetic variants, for complementing PSA in prostate cancer detection (9–15).
Twin studies indicated that genetic variants accounted for approximately 40% of the etiologic factors in prostate cancer (16). However, individual genes with high penetrance have not been identified yet for the disease. Numerous genome-wide association studies (GWAS) and quantitative genetic analyses have clearly shown that prostate cancer susceptibility is a polygenic complex trait, with the heritability estimates ranging from 0.25 to 0.78 (10, 17, 18). By 2018, about 140 common risk single-nucleotide polymorphism (SNP) markers for prostate cancer susceptibility, aggressiveness, and mortality have been reported (19, 20). However, the utility of germline SNPs in prostate cancer diagnosis has remained controversial to date. On one hand, in a few relevant studies, genetic panels integrating the genotypes of multiple risk SNP loci show significant predictive strength for prostate cancer occurrence (10, 12, 14). On the other hand, those results are not in favor of translational applications. In particular, the AUC statistics of such a genetic marker panel, obtained from receiver-operating characteristic (ROC) analysis of a case–control dataset, usually fall within the interval of 0.55–0.60, much lower than what can be achieved by using the PSA marker alone (10, 14, 21).
With regard to the aforementioned issue, it is evident that there is a controversy between the promise in getting public health benefits from using genomic information in prostate cancer detection and the drawback of existing marker panels in practical applications. We presume that the dilemma arising from this situation can be addressed by the appropriate integration of data augmentation procedure and statistical and machine learning methods. To test this hypothesis, we perform a comprehensive analysis of several GWAS datasets in the public domain. Our study is not an exhaustive investigation, but it is guided by an existing theory that outlines the dependence of the expected predictive strength of SNP markers for a binary complex trait on the size of the training dataset, the number of the actually used markers and their contribution ratio to the trait variance (22).
Materials and Methods
Data
Two datasets collected from subjects of European ancestry and one collected from subjects of African ancestry are used in this study. They were generated by three GWAS, that is, CGEMS (The Cancer Genetic Markers of Susceptibility), BPC3 (The Breast and Prostate Cancer Cohort Consortium), and MEC (Multiethnic Cohort Study), respectively (17, 23–25). A nested case–control design was adopted in these projects or experiments. The clinical and genomic data were downloaded from dbGaP (The database of Genotypes and Phenotypes, https://www.ncbi.nlm.nih.gov/gap/) with the Study Accessions being “phs000207.v1.p1”, “phs000812.v1.p1”, and “phs000306.v3.p1”, respectively.
The CGEMS project genotyped 1,172 patients with prostate cancer and 1,157 control subjects of European ancestry enrolled in the prostate, lung, colon, and ovarian (PLCO) Cancer Screening Trial (http://www.cancer.gov/prevention/plco/). In the cohort, aggressive cases amounted to 57% of the total cases. The control samples were chosen by the following three criteria: (i) same year of entry into the cohort as the case set; (ii) same 5-year age-at-entry interval (55–59, 60–64, 65–69, 70–74) as the case set; and (iii) alive and cancer-free through the year of follow-up in the case set. In our study, only the subjects (1,147 cases and 1,098 controls) genotyped in both phase 1 and 2 are used.
The BPC3 project (https://epi.grants.cancer.gov/bpc3/) generated data of multiple cancer types. All the prostate cancer cases (N = 2,758) in BPC3 either had high histologic grade (Gleason score ≥ 8) or extra-prostatic extension (stage C/D; confirmed through medical records). Controls (N = 4,482) were required to be cancer-free and alive at the matched cases' ages at diagnosis. The subjects of European ancestry involved in our analysis are distributed in three consent groups, that is, the C1, C2, and C4, by the authors of the data.
The MEC project generated a pool of prostate cancer data of multiple races, including African Americans (AA), Latino Americans, and Japanese in Hawaii. The case samples in MEC include both aggressive and non-aggressive patients with prostate cancer. Controls were required to be cancer-free and alive at the matched cases' ages at diagnosis. Only the AA data (MEC-AA), which contain 2,306 cases and 2,463 control samples, were used in our study. The subjects analyzed in this study are distributed into four consent groups, that is, the C1, C2, C4, and C5, by the authors of the data.
The genomic platforms for generating these three datasets are different. In these datasets, there are 541,519 SNPs in common, which constitute the MK-AW set (or panel) as termed in our analysis.
The use of these datasets in this study was approved by the National Institutes of Health. The original patient studies from which the data were generated were conducted in accordance with the U.S Common Rule. Each of those studies obtained informed written consent from the participants and the approval from their respective institutional review boards.
Genetic panel for prostate cancer risk and inter-cohort data augmentation
61 SNPs of the MK-AW panel are among the prostate cancer risk markers identified by the previous case–control cohort studies and are collected in Chen and colleagues (26). We consider these SNPs as a genetic panel, termed “MK-61”. By hypothesizing that, besides the index risk SNPs, there is still a large set of SNPs that may have relevance to prostate cancer occurrence, we generate an MK-61+ panel for each dataset. This is achieved by augmenting the MK-61 panel with a few thousands of SNPs that show a moderately significant association with prostate cancer occurrence in one (if the focused dataset is composed of subjects of European ancestry) or two (if the focused dataset is composed of subjects of African ancestry) external datasets. Fig. 1 depicts the inter-cohort data augmentation scheme.
The scheme for the proposed inter-cohort data augmentation of marker panels. All SNPs in the MK-61 panel are included in the MK-61+ panels, regardless of the significance level and direction of allele effects on prostate cancer occurrence. In addition to the SNPs included in the MK-61 panel, an SNP to be included in the MK-61+ panel of the MEC-AA cohort is required to meet the following two criteria: Meta-analysis P value (Pmeta, calculated using Fisher's method) less than 0.001 and the direction consistence of allele effects in the two external cohorts, i.e. CGEMS and BPC3. Note that different cutoffs of P value are used in CGEMS and BPC3 because CGEMS has a relatively small sample size compared with BPC3 (i.e., 2245 vs. 7240), leading to a lower statistical power. Had the criterion of P < 0.001 been used in CGEMS, only 697 SNPs could have obtained.
The scheme for the proposed inter-cohort data augmentation of marker panels. All SNPs in the MK-61 panel are included in the MK-61+ panels, regardless of the significance level and direction of allele effects on prostate cancer occurrence. In addition to the SNPs included in the MK-61 panel, an SNP to be included in the MK-61+ panel of the MEC-AA cohort is required to meet the following two criteria: Meta-analysis P value (Pmeta, calculated using Fisher's method) less than 0.001 and the direction consistence of allele effects in the two external cohorts, i.e. CGEMS and BPC3. Note that different cutoffs of P value are used in CGEMS and BPC3 because CGEMS has a relatively small sample size compared with BPC3 (i.e., 2245 vs. 7240), leading to a lower statistical power. Had the criterion of P < 0.001 been used in CGEMS, only 697 SNPs could have obtained.
Methods for theoretical prediction
On the basis of a quantitative genetic model, Dudbridge developed a methodology to estimate the predictive strength of a polygenic score (PGS) metric for complex traits, including prostate cancer susceptibility (22). In addition to serving as a main reference for the development of our approaches, this method is used here to obtain theoretical prediction results, which complement the results achieved by the cross-validation–based empirical analyses. The main points of this method are briefly outlined as follows.
Let TR-1 and TR-2 denote two complex traits. Assume that their variances are commonly contributed by the SNPs in a marker set of size m. Then, the SNP effects on TR-1 estimated from a training sample set of size n1 can be used to calculate a PGS vector |\tilde{S}$| regarding TR-2 for the samples in an independent (or validation) set of size n2. Given these assumptions and arguments, the determination coefficient of the PGSs on TR-2 phenotypic values, collected in the vector Y, of the validation set can be calculated by the following equation.
In Equation (i), |\sigma _A^2$| denotes the additive genetic variance (equivalent to heritability when Y is in the liability scale) contributed by the variants on the loci of the marker set.
The expected AUC, that is, the area under curve of a ROC analysis, on the validation set can be calculated as follows.
The conditional expectations and variances of |\tilde{S}$| in Equation (ii), including |E(\tilde{S}|y = 1)$| and others, can be calculated using Dudbridge's formulae for case–control studies. In addition to |\sigma _A^2,$| |{n_1}$| and |R_{\tilde{S},Y}^2,$| the required arguments include the prevalence of cases of a binary trait in the population of the validation set, the sampling proportion of cases in creating the validation set and the number (m) of the considered markers. Φ() denotes the standard normal probability function.
Methods for empirical prediction
In all the models used in this study, the binary-dependent response variable indicates a disease status, that is, case or control. Hereafter, the relevant descriptions are focused on the independent (predictive) variables or features.
Two statistical classification methods, that is, multiple-variable logistic regression (MLR) and support vector machine (SVM), are used to predict prostate cancer occurrence from SNP markers. Variables holding SNP genotypes are treated as the features in SVM or predictive variables in MLR modeling. Proceeding advanced analyses, the genotypes are encoded using an additive genetic model regarding individual locus. That is, the three possible genotypes are recoded as 0, 1, or 2 according to the number of the variant allele. For a validation/test sample, MLR method estimates a PGS, which is the sum of the genotype codes weighted by the regression coefficients, whereas SVM method computes a pseudo PGS, that is, a decision value as often termed in literature and software instructions. The obtained (pseudo) PGSs, together with the actually observed disease status (i.e., case or control), are used to calculate the ROC AUC statistic.
We use the 10-fold cross-validation to assess the performance of a model. To facilitate computation, a feature (or predictive variable) selection procedure is nested in each round of the cross-validation on the data in which the subjects in the holdout validation set are excluded. This is performed by testing the associations between the disease status (i.e., case or control) and genotypes of individuals. The cutoff P value is loosely specified so that the refined MK-61+ marker set includes 3,000–5,000 SNPs.
Dimension reduction of feature matrix for using MLR
Each of the MK61+ panels, even after the aforementioned refinement, contains too many SNPs to be modeled by the MLR. As such, we perform an additional dimension reduction using the singular value decomposition (SVD) method. The implementation procedure includes three steps. First, the (recoded) genotype matrix of t subjects and q SNPs in the training set (|{M_{t \times q}}$|) is factorized by |{M_{t \times q}}= {U_{t \times q}}{D_{q \times q\ }}{( {{V_{q \times q}}} )^{\prime}}$| when t > q or by |{M_{t \times q}}= {U_{t \times t}}{D_{t \times t\ }}{( {{V_{q \times t}}} )^{\prime}}$| when t < q. |{U_{t \times q}},\ {( {{V_{q \times q}}} )^{\prime}},\ {U_{t \times t\ }}$| and |{( {{V_{q \times t}}} )^{\prime}}$| are orthogonal matrices. |{D_{q \times q}}$| and |{D_{t \times t}}$| are a diagonal matrix with all the non-zero elements being positive and arranged from high to low. Second, the genotype matrix of v subjects in the validation set (|M_{v \times q}^*$|) is transformed using the formula |U_{v \times q}^* = M_{v \times q}^*{V_{q \times q}}{( {{D_{q \times q}}} )^{ - 1}}$| or |U_{v \times t}^*\ = \ M_{v \times q}^*{V_{q \times t}}{( {{D_{t \times t}}} )^{ - 1}}$|. Third, the first 100 columns of |{U_{t \times q}}$| and |U_{v \times q}^*$| in the case of t > q (|or\ {U_{t \times t}}$| and |U_{v \times t}^*$| in the case of t < q), called top Eigen-genotype vectors hereby, are used as the matrices of predictive variables in training the classification model and computing the PGSs of validation samples, respectively. Note that in the second step, the computation of Eigen-genotypes (contained in the corresponding row of the matrix |U_{v \times q}^*\ {\rm{or}}\ U_{v \times t}^*$|) of a specific validation sample is independent of the genotypes of other validation samples, implying that a test set can contain only one subject.
Methods for integrating PSA levels and SNP markers
For prostate cancer prediction, we use the following two methods to integrate PSA levels and genotype codes of SNPs.
MLR-psa&snp
The PSA levels and top Eigen-genotype vectors are simultaneously included as the predictive variables in a logistic model. The regression coefficients estimated on the training samples are used to calculate the psa-PGSs of the holdout validation samples.
SVM-MLR-psa&snp
This is a hybrid two-model algorithm. It includes three steps. First, the training data are randomly partitioned into two parts (i.e., P-A and P-B) with the sample size ratio of 9:1. Then, using the P-A data, an SVM model with SNP genotypes as features is trained to predict the decision values of samples in P-B and the holdout validation set. After that, using the P-B data, an MLR model with the PSA levels and SVM prediction values as two predictive variables is established to calculate psa-PGSs of the validation samples.
Software application and special notes of statistical methods
The tool PLINK 1.9 (27, 28) is used to manipulate SNP data and perform the association test for cancer occurrence and variants on individual locus. Another genetic data analysis software GCTA 1.91.7 (29, 30) is used to estimate the additive genetic variance (or heritability) of liability scaled prostate cancer susceptibility contributed by an SNP set. Other analyses and graphics are completed by using the relevant functions in the R packages, including “e1071,” “stats,” and “AUC” as well as the lab-owned R code. When implementing the svm() function, we use the sigmoid kernel and the class weights are specified as the reciprocals of the ratios of the positive (i.e., “1”) samples to the negative (i.e.,“−1”) samples in the training set.
The comparison of AUC estimates between models is performed using the two-tailed t test with unequal variances. In training a prediction model, artificial noises |x \sim norm( {0,001} )$| are added to the genomic code matrix of the training samples to alleviate potential overfitting as advocated by Holmstrom and colleagues (31).
Results
Theoretical predictions
Using the proposed inter-cohort data augmentation strategy (Fig. 1), we obtain an MK-61+ marker set (or panel) for each one of the CGEMS, BPC3, and MEC-AA cohorts. These sets contain 18,916, 6,180, and 9,793 SNPs, respectively.
Regarding each cohort, we estimate the liability-scale heritability of prostate cancer susceptibility attributed to its MK-61+ SNP set as well as the common MK-61 and MK-AW sets. In the used mixed model, SNPs are included as the random factors. The age of a subject at diagnosis (for “case” samples) or enrollment (for “control” samples) and 10 variables representing within-population genetic stratification, that is, the top-10 principal components of the MK-AW–based genomic relationship matrix (32), are considered as the fixed effect factors.
Using the information of MK-61, MK-61+ or MK-AW sets, we calculate a cohort-specific 95% confidence interval (CI) of the prediction accuracy in terms of AUC. When computing the upper (or lower) bound, the heritability (|{h^2})\ $|estimate plus (or minus) its two-time standard error is used as the additive genetic variance. The disequilibrium among SNPs is also considered. This is, in using the equation (i) to calculate the coefficient of determination (|R_{\tilde{S},Y}^2)$|, we value the m with the number of SNPs after the data are pruned using the PLINK software. For a cohort-specific MK-61+ panel, the three parameters for independently pairwise variants pruning, that is, window size in variant count units, variant count to shift the window at the end of each step and variance inflation factor (VIF), are specified as 10, 5, and 2, respectively. For the MK-AW panel, these parameters are specified as 30, 5, and 2, respectively.
The results from the theoretical prediction are summarized in Table 1. The array-wide SNP set (MK-AW) shows a modest predictive strength with the upper bound of the AUC statistic lower than 0.57 in all the three cohorts, although the SNPs can explain at least 22.7% of the liability variance (the heritability |{h^2}\ = \ 0.227)$|. The small MK-61 panel shows a significant predictive strength in the two European cohorts with the AUC values higher than 0.62, although its heritability estimates are less than 0.1. According to the heritability estimates, the genetic variances contributed by MK-61+ panels are about 70%–80% of the contribution from the MK-AW set. Their performance is slightly better than or similar to that of the MK-61 panel in CGEMS and MEC-AA cohorts. Because of the relatively large standard error of the heritability estimate, the predictive strength of the MK-61+ panel in BPC3 is elusive. That is, although the upper bound (0.736) of the AUC statistic is very promising for a translational application, the lower bound (0.527) indicates great uncertainty. This dilemma demands clarification by further analyses via empirical predictions.
Quantitative genetic model-based estimates of AUC statistics.
Data and marker set . | #Sample . | Case ratioa . | #SNP . | Cb . | h2 . | σh2 . | AUC . | AUC CI.95 . |
---|---|---|---|---|---|---|---|---|
CGEMS: MK-61 | 2245 | 0.51 | 61 | — | 0.075 | 0.018 | 0.621 | 0.578–0.655 |
CGEMS: MK-61+ | 2245 | 0.51 | 18,916 | 0.59 | 0.664 | 0.058 | 0.64 | 0.617–0.663 |
CGEMS: MK-AW | 2245 | 0.51 | 541,519 | 0.28 | 0.791 | 0.194 | 0.547 | 0.525–0.56 |
BPC3: MK-61 | 7240 | 0.38 | 61 | — | 0.069 | 0.014 | 0.629 | 0.596–0.656 |
BPC3: MK-61+ | 7240 | 0.38 | 6,180 | 0.64 | 0.252 | 0.107 | 0.642 | 0.527–0.736 |
BPC3: MK: AW | 7240 | 0.38 | 541,519 | 0.28 | 0.343 | 0.063 | 0.537 | 0.524–0.55 |
MEC-AA: MK-61 | 4769 | 0.48 | 61 | — | 0.031 | 0.009 | 0.572 | 0.539–0.596 |
MEC-AA: MK-61+ | 4769 | 0.48 | 9,793 | 0.77 | 0.158 | 0.013 | 0.556 | 0.547–0.565 |
MEC-AA: MK-AW | 4769 | 0.48 | 541,519 | 0.42 | 0.227 | 0.097 | 0.515 | 0.502–0.528 |
Data and marker set . | #Sample . | Case ratioa . | #SNP . | Cb . | h2 . | σh2 . | AUC . | AUC CI.95 . |
---|---|---|---|---|---|---|---|---|
CGEMS: MK-61 | 2245 | 0.51 | 61 | — | 0.075 | 0.018 | 0.621 | 0.578–0.655 |
CGEMS: MK-61+ | 2245 | 0.51 | 18,916 | 0.59 | 0.664 | 0.058 | 0.64 | 0.617–0.663 |
CGEMS: MK-AW | 2245 | 0.51 | 541,519 | 0.28 | 0.791 | 0.194 | 0.547 | 0.525–0.56 |
BPC3: MK-61 | 7240 | 0.38 | 61 | — | 0.069 | 0.014 | 0.629 | 0.596–0.656 |
BPC3: MK-61+ | 7240 | 0.38 | 6,180 | 0.64 | 0.252 | 0.107 | 0.642 | 0.527–0.736 |
BPC3: MK: AW | 7240 | 0.38 | 541,519 | 0.28 | 0.343 | 0.063 | 0.537 | 0.524–0.55 |
MEC-AA: MK-61 | 4769 | 0.48 | 61 | — | 0.031 | 0.009 | 0.572 | 0.539–0.596 |
MEC-AA: MK-61+ | 4769 | 0.48 | 9,793 | 0.77 | 0.158 | 0.013 | 0.556 | 0.547–0.565 |
MEC-AA: MK-AW | 4769 | 0.48 | 541,519 | 0.42 | 0.227 | 0.097 | 0.515 | 0.502–0.528 |
aThe ratio of the number of case subjects to the cohort size.
bCoefficients for adjusting the sizes of MK-61+ and MK-AW marker sets (or panels) used in calculating the AUC statistics. They are the ratios of the SNP numbers counted after the data are pruned using the PLINK software to the SNP numbers (#SNP) before the pruning. For example, in the scenario of “CGEMS: MK-61+,” the marker set size (m) used in the Equation (i) is 18,916 × 0.59 = 11,160. h2 is the trait heritability.
Empirical predictions
We conduct empirical predictions for prostate cancer on the CGEMS, BPC3, and MEC-AA cohort data using SVM and MLR methods. Regarding SNP markers, the analysis is focused on the MK-61 and MK-61+ sets. The MK-AW set is not involved in the analysis due to the following reasons. First, no previous study has shown that a whole-array SNP set has a significant predictive strength for prostate cancer. Second, the size of the MK-AW set is too large for implementing SVM and MLR. Finally, our preliminary study shows that a marker panel selected on the information of an intra-cohort training set performs poorly when applied to an independent sample set of the same cohort.
The empirical results are summarized in Table 2. The calculation of AUC is performed on the output (i.e., PGSs of validation samples) of each individual round of the 10-fold cross-validation. From the table, we have the following observations. First, the employed methods, that is, SVM and MLR, achieve similar performance on all the three cohorts. Second, MK-61+ sets have an advantage over the MK-61 set in the CGEMS and BPC3 cohorts but not in the MEC-AA cohort. Finally, all the empirical AUC statistics fall within the corresponding 95% CIs of the theoretical AUC statistics (see Table 1). In particular, for the scenario of the BPC3 cohort and MK-61+ set, the empirical AUC (0.703) is close to the upper bound of the theoretical AUC.
Empirical AUC statistics estimated using SVM or MLR methods via the 10-fold cross-validation strategy.
Data and marker set . | #Featuresa . | SVMb . | p-SVMc . | MLRb . | p-MLRd . |
---|---|---|---|---|---|
CGEMS: MK-61 | 21–29 | 0.597 (0.041) | 0.599 (0.042) | ||
CGEMS: MK-61+ | 2,370–2,517 | 0.63 (0.036) | 0.06 | 0.624 (0.046) | 0.20 |
BPC3: MK-61 | 40–45 | 0.634 (0.023) | 0.641 (0.02) | ||
BPC3: MK-61+ | 2,048–2,140 | 0.703 (0.015) | |3 \times {10^{ - 6}}$| | 0.695 (0.015) | |5 \times {10^{ - 5}}$| |
MEC-AA: MK-61 | 13–21 | 0.546 (0.025) | 0.57 (0.025) | ||
MEC-AA: MK-61+ | 1,218–1,291 | 0.541 (0.016) | 0.70 | 0.547 (0.018) | 0.05 |
Data and marker set . | #Featuresa . | SVMb . | p-SVMc . | MLRb . | p-MLRd . |
---|---|---|---|---|---|
CGEMS: MK-61 | 21–29 | 0.597 (0.041) | 0.599 (0.042) | ||
CGEMS: MK-61+ | 2,370–2,517 | 0.63 (0.036) | 0.06 | 0.624 (0.046) | 0.20 |
BPC3: MK-61 | 40–45 | 0.634 (0.023) | 0.641 (0.02) | ||
BPC3: MK-61+ | 2,048–2,140 | 0.703 (0.015) | |3 \times {10^{ - 6}}$| | 0.695 (0.015) | |5 \times {10^{ - 5}}$| |
MEC-AA: MK-61 | 13–21 | 0.546 (0.025) | 0.57 (0.025) | ||
MEC-AA: MK-61+ | 1,218–1,291 | 0.541 (0.016) | 0.70 | 0.547 (0.018) | 0.05 |
aThe range of the numbers of SNPs considered in all rounds of a 10-fold cross-validation process.
bThe mean and standard deviation of AUCs for the 10 hold-out validation subsets.
cP values for the comparison between the MK-61 and cohort-specific MK-61+ using SVM results and t test.
dP values for the comparison between the MK-61 and cohort-specific MK-61+ using MLR results and t test.
As showed in Fig. 2, we further examine the clinical utility of SNPs for prostate cancer prediction using the output of the SVM model. That is, for a specific combination of cohort and marker panel, the ROC, AUC, and other performance metrics such as false-positive rate (FPR; see Supplementary Text S1 for the definitions) are presented on the corresponding plots of this figure. Here, the (pseudo) PGSs of all subjects estimated from the 10-fold cross-validation are aggregated preceding the AUC computation. The default cutoff (i.e., zero) of PGSs for determining the disease status is used to calculate the FPR (1–specificity), false-negative rate (FNR, 1–sensitivity), positive prediction value (PPV), and negative prediction value (NPV). In the BPC3 cohort, the MK-61+ panel shows a promising outlook for practical applications with both the FPR and FNR being 0.35.
Performance evaluation of the MK-61 and MK-61+ panels in predicting prostate cancer occurrence. SVM and the 10-fold cross-validation strategy are used. The default cutoff (i.e., zero) for determining the disease status (i.e., case or control) is used to calculate the FPR, FNR, PPV, and NPV. The red and black curves represent the results obtained using MK-60 and MK-61+ panels, respectively.
Performance evaluation of the MK-61 and MK-61+ panels in predicting prostate cancer occurrence. SVM and the 10-fold cross-validation strategy are used. The default cutoff (i.e., zero) for determining the disease status (i.e., case or control) is used to calculate the FPR, FNR, PPV, and NPV. The red and black curves represent the results obtained using MK-60 and MK-61+ panels, respectively.
Integrating PSA and SNPs
Using the data of the CGEMS cohort, which is a subset of the PLCO project, we compare the prediction strength for prostate cancer occurrence using PSA, SNPs, or the combinations of them. The performances of the two aforementioned methods for integrating PSA and SNP information (i.e., MLR-psa&snp and SVM-MLR-psa&snp) are also compared. We retrieve the PSA levels from the PLCO database (Supplementary Text S2) and set the maximum PSA value to be 45 (ng/mL). Prediction and performance assessment procedures are repeated on 50 working datasets (WD). Each of these WDs contains 860 randomly sampled controls and 860 “semi-randomly” sampled cases from the CGEMS data (Supplementary Text S3). This analysis scheme is designed to alleviate the potential bias due to the deviation of distribution profile (median = 5.48 ng/mL) of PSA levels of the CGEMS case subjects from the profile (median = 4.43 ng/mL) of all the prostate cases in the PLCO data (Supplementary Fig. S1).
For a marker set (MK-61 or MK-61+), the 10-fold cross-validation and the two integration methods, that is, MLR-psa&snp and SVM-MLR-psa&snp, are applied to each WD. AUC statistic is calculated by the psa-PGSs of all the subjects in a WD. Moreover, a specific AUC for each one of four PSA-based subject classes, that is, |{\rm{PSA}} \,\lt \,2.0{\rm{ng}}/{\rm{mL}}$|, |{\rm{PSA}} \ge 2.0{\rm{\ ng}}/{\rm{mL}}$|, 4.0|{\rm{ng}}/{\rm{mL}} \gt {\rm{PSA}} \ge 2.0$|ng/mLand PSA ≥ |4.0{\rm{\ ng}}/{\rm{mL}}$|, is calculated. In each scenario (i.e., the combination of a method and a subject class), the AUCs from the 50 WDs are aggregated by computing the mean and standard deviation. It is worth noting that, in defining the PSA-based subject classes, we set the cutoffs of blood PSA levels at 2.0 and 4.0 ng/mL. The two cutoffs are the ends of the range of the typically adopted criterions (https://www.cancer.org/cancer/prostate-cancer/detection-diagnosis-staging/tests.html), above which a biopsy is often recommended.
Empirical results from the integrative analysis for the MK-61+ and MK-61 panels are summarized in Table 3 and Supplementary Table S1, respectively. As indicated by the AUC estimates, the MK-61+ panel is able to complement PSA levels in predicting the risk of prostate cancer for men with a PSA level over 2.0 ng/mL. In particular, for the subject class of |{\rm{PSA}} \ge 4.0\,{\rm{ng}}/{\rm{mL}}$|, further stratifying men on their PSA levels cannot lead to additional benefit for distinguishing case and control samples but the predictive strength of the MK-61+ panel still holds. As to integration methods, SVM-MLR-psa&snp slightly outperform MLR-psa&snp. Similar to the MK-61+ panel, the MK-61 panel maintains its AUC in the subject class of PSA ≥ 4.0 ng/mL. However, it is unable to complement PSA levels in achieving higher prediction performance in other subject classes.
. | LR-psa . | SVM-snp . | MLR-snp . | SVM-MLR-psa&snp . | SVD-MLR-psa&snp . |
---|---|---|---|---|---|
All subjects | 0.885 (0.003) | 0.623 (0.015)*** | 0.614 (0.013)*** | 0.886 (0.005) | 0.838 (0.007)*** |
PSA < 2.0 ng/mL | 0.680 (0.006) | 0.595 (0.02)*** | 0.581 (0.02)*** | 0.683 (0.016) | 0.652 (0.018)*** |
PSA |\ge $|2.0 ng/mL | 0.756 (0.008) | 0.616 (0.021)*** | 0.604 (0.014)*** | 0.762 (0.014)* | 0.701 (0.013)*** |
4.0 > PSA |\ge $|2.0 ng/mL | 0.623 (0.009) | 0.606 (0.024)*** | 0.593 (0.018)*** | 0.642 (0.019)*** | 0.608 (0.018)*** |
PSA |\ge \ $|4.0 ng/mL | 0.553 (0.012) | 0.617 (0.027)*** | 0.608 (0.023)*** | 0.598 (0.028)*** | 0.619 (0.021)*** |
. | LR-psa . | SVM-snp . | MLR-snp . | SVM-MLR-psa&snp . | SVD-MLR-psa&snp . |
---|---|---|---|---|---|
All subjects | 0.885 (0.003) | 0.623 (0.015)*** | 0.614 (0.013)*** | 0.886 (0.005) | 0.838 (0.007)*** |
PSA < 2.0 ng/mL | 0.680 (0.006) | 0.595 (0.02)*** | 0.581 (0.02)*** | 0.683 (0.016) | 0.652 (0.018)*** |
PSA |\ge $|2.0 ng/mL | 0.756 (0.008) | 0.616 (0.021)*** | 0.604 (0.014)*** | 0.762 (0.014)* | 0.701 (0.013)*** |
4.0 > PSA |\ge $|2.0 ng/mL | 0.623 (0.009) | 0.606 (0.024)*** | 0.593 (0.018)*** | 0.642 (0.019)*** | 0.608 (0.018)*** |
PSA |\ge \ $|4.0 ng/mL | 0.553 (0.012) | 0.617 (0.027)*** | 0.608 (0.023)*** | 0.598 (0.028)*** | 0.619 (0.021)*** |
aAUC statistics and the standard deviations (in parentheses) in the columns 2–6 are calculated according to the results obtained on 50 working datasets (WD) sampled from the CGEMS (PLOCO) cohort.
bLR-psa: a simple logistic regression model is used with PSA level as the only predictive variable. SVM-snp (or MLR-snp): SVM (or MLR) is used with SNPs in the MK-61+ panel as the predictive features (variables). SVM-MLR-psa&snp, SVD-MLR-psa&snp: see the descriptions in the Materials and Methods section.
cIn comparing the results of different methods, a t test is performed with LR-psa as the control and another method as the treatment. The best AUCs for entire cohort and a PSA-based subject category are marked with bold. *, **, and *** indicate P value being 0.05, <0.01, and <0.001, respectively.
To assess the clinical utility of the aforementioned complementation of the MK-61+ panel to PSA in predicting prostate cancer for the males with PSA ≥ 2.0 ng/mL, we used the output from the SVM-MLR-psa&snp algorithm to calculate the FNR when the cutoff of the psa-PGSs for determining the disease status is set in such a manner that FPR is at the level of 0.25. The mean (|\bar{\rm{ {x}}})$| and standard deviation (sd) of the FNR estimates over the 50 working datasets are 0.356 and 0.033, respectively. Given the same FPR setting, the corresponding descriptive statistics (|{\bar {\rm{x}}\ {\rm and}\ sd}}){\rm{\ }}$|of FNR estimates from LR.psa (i.e., using PSA level alone) are 0.377 and 0.02, respectively. The difference (0.356–0.377 = 0.021) between the two sets of FNR estimates is extremely significant (two-tailed t test P < 0.001). Such a comparative result still holds when the FPR is set at other reasonable levels such as 0.20 or 0.30.
Scatter chart illustration for clinical utility of SNP markers
In this subsection, we present a graphical method to integrate PSA levels and SNP information to stratify subjects for prostate cancer diagnosis. This strategy may facilitate the integrative application, especially in optimizing large-scale prostate cancer screening plans. SNP information is summarized with PGSs that can be estimated using any existing or to-be-developed methods. As the SVM-based estimates are somewhat better than the MLR-based estimates in complementing PSA to predict prostate cancer occurrence (Table 3), we use the SVM method in the following analysis.
Given a cohort data, the (pseudo) PGSs of the involved subjects are estimated via the 10-fold cross-validation. A scatter chart is created with PGSs on the x-axis and log2 (PSA levels ) − 1on the y-axis, where the PSA levels are limited with the maximum value of 32|\ {\rm{ng}}/{\rm{mL}}$|. The chart area is partitioned into four regions by the zero-crossing horizontal and vertical lines. The top-left region is further portioned by a diagonal line. In this way, the subjects can be divided into five clusters according to their position on the chart (Fig. 3A). The stratification can also be depicted using a tree model (Fig. 3B).
The chart-partition–based tree model for integrating PSA levels and SNP information to stratify males for the risk prediction of prostate cancer. PGS denotes (pseudo) PGS estimated using the SVM and 10-fold cross-validation strategy. t.PSA is calculated as |lo{g_2}( {{\rm{PSA\ level}}} ) - 1$|, where the PSA levels are limited with the maximum value being 32 |{\rm{ng}}/{\rm{mL}}$|. A, Red and black dots denote case and control subjects, respectively. 500 randomly selected case or control subjects are mapped on the plot.
The chart-partition–based tree model for integrating PSA levels and SNP information to stratify males for the risk prediction of prostate cancer. PGS denotes (pseudo) PGS estimated using the SVM and 10-fold cross-validation strategy. t.PSA is calculated as |lo{g_2}( {{\rm{PSA\ level}}} ) - 1$|, where the PSA levels are limited with the maximum value being 32 |{\rm{ng}}/{\rm{mL}}$|. A, Red and black dots denote case and control subjects, respectively. 500 randomly selected case or control subjects are mapped on the plot.
Taking a working dataset (one of those used in the “Integrating PSA and SNPs” subsection) sampled from the CGEMS cohort as the example, we calculate several statistics for each one of the clusters determined by the chart partition–based tree model. These statistics include the number of case subjects (|{N_{case}}$|), the number of control subjects (|{N_{ctr}}$|), the rate of cases |({R_s}\ = \ {N_{case}}/( {{N_{case}} + {N_{ctr}}} ))$| and the expected rate of men with prostate cancer diagnosis in European American (EA) population (|{R_p})$|. |{R_p}$| is calculated with |{R_p} = K{N_{case}}/( {K{N_{case}} + ( {1 - K} ){N_{ctr}}} )$|, in which K ( =0.14) denotes prostate cancer prevalence.
The clinical utility of the chart partition–based stratification can be scrutinized from a comparison with a stratification in which only PSA information is used (Supplementary Table S2). When the PSA cutoff for prostate cancer occurrence is set at 4.0 ng/mL, the FDR (1-PPV) in the cohort and in the population are 0.117 (1–0.883) and 0.486 (1–0.514), respectively. While these FDR measures are not very poor, the false omission rates (FOR, 1-NPV) are too high for prostate cancer diagnosis. Although the value in the population is only 0.048, it could lead to an FNR of 0.32 (|\frac{{279}}{{( {279 + 581} )}}$|), that is, a missed diagnosis for 32% cancer cases. On the other hand, when the PSA cutoff for prostate cancer occurrence is set at 2.0 ng/mL, the FOR as well as FNR are much better but the high FDR, which is 0.691 in the population, may lead to a great number of misdiagnoses. Using our method, the predicted negative subjects (clusters IV and V) have a FOR similar to the result achieved by the |PSA\,\lt \,2$| ng/mL criterion, and the predicted positive subjects (clusters I and II) have an FDR close to the result obtained by the |PSA \ge 4$| ng/mL criterion.
Discussions
We carried out a comprehensive analysis of three public case–control cohort datasets toward clarifying the utility of germline polymorphisms in prostate cancer diagnosis. As to the predictive strength of a pre-selected SNP panel for prostate cancer occurrence, our theoretical and empirical results suggest that an accuracy with |{\rm{AUC}} \gt 0.65$| is practically expectable even when the prediction is based on a training set with a modest sample size (N < 10,000). Our results also suggest the potential clinical utility of SNPs for the males of European ancestry. First, on the relatively large BPC3 cohort (N = 7,240), the MK-61+ panel achieved an encouraging prediction performance with both FPR and FNR simultaneously being 0.35. As SNP genotypes of a subject can be measured at any age, such a result has a special implication for prostate cancer prevention. That is, a male could be informed of his risk level for the disease when he is very young. Second, on the modestly sized (N = 2,245) CGEMS cohort with publically available PSA information, we found that MK-61+ panel could complement PSA in predicting prostate cancer for the males with PSA ≥ 2.0 ng/mL. Moreover, we proposed a simple graphic method to facilitate the integrative application of PSA and SNPs in prostate cancer screening.
However, there is a limitation regarding the clinical utility of SNPs presented in this study. Metastasis is a primary cause of morbidity and mortality for patients with prostate cancer. Nearly all men with metastatic prostate cancer develop resistance to androgen deprivation therapy, a state known as metastatic castration-resistant prostate cancer (mCRPC; refs. 33, 34). The genetic evaluation of a male for the risk of mCRPC and other aggressive forms of prostate cancer may be more important than predicting the occurrence of the cancer. Our results do not provide insights on this issue. A major challenge in using SNPs to predict mCRPC is that a substantial set of markers associated solely with this cancer form has not been established yet.
A genome-wide whole-array SNP panel (MK-WA) can explain over 20% of the variance of prostate cancer susceptibility but usually is lacking of predictive strength for cancer occurrence in individual man if it is not refined by a preceding feature selection step (10). In this study, we proposed an inter-cohorts data augmentation strategy to complement the risk SNP panel (MK-61) toward potential improvement in its practical utility. Theoretical and empirical prediction results show that a reinforced marker panel (MK-61+) obtained by this strategy has remarkable advantage over the MK-61 and MK-WS panels. The augmentation method can be flexibly implemented and has room for improvement, as scrutinized in the following. Given a cohort data (e.g., Cohort-A), the MK-61+ panel could have multiple versions derived from varied extra-cohort datasets, separately or jointly. The favorable version would be the one with a size comparable with others but can explain more variance of prostate cancer susceptibility in Cohort-A. Moreover, the functional relevance of an SNP, such as if this SNP is located on a microRNA-binding site or is in strong linkage disequilibrium with other variants at such a site, could be considered in prioritizing of SNPs on which the data augmentation is performed.
It is evident that the AUC improvement due to the marker augmentation is less significant when the MK-61+ panel is applied to the CGEMS cohort, compared with the BPC3 one. The underlying reason could be ascribed to the limited sample size of CGEMS such that a large feature set could not sufficiently demonstrate its advantage in cancer prediction, as suggested by the theory in the reference (22).
In the empirical prediction, we applied two analysis methods, that is, MLR and SVM, to each cohort dataset. The trivial differences between the results of MLR and SVM indicate that the predictive strength of an SNP panel may be robust to the employment of different classification algorithms. Nevertheless, this observation does not rule out the possibility of an enhanced polygenic prediction of the disease by using more advanced deep learning methods. Montañez and colleagues (35) recently published a relevant study, in which a multilayer feedforward neural network was adopted to predict human obesity using SNP information. From a marker panel of 2,465 SNPs, they obtained a very high-predictive result (AUC = 0.99). However, the work was somewhat flawed due to the ignorance of a well-known pitfall in predicting complex traits from SNPs (36). That is, the holdout validation (and test) samples were only excluded from the model training step but not from the feature selection step.
Theoretically, even adopting our feature selection strategy, the polygenic prediction of prostate cancer occurrence could not be completely free from potential overfitting. This problem arose from the fact that the information of all the subjects in the CGEMS and BPC3 cohorts was used when some members of the base marker set, that is, the MK-61, were initially identified. This problem could be further scrutinized from two aspects. First, each of the original GWAS studies of these cohorts (23, 24) contributed two SNPs to the MK-61 panel. Our additional analysis indicated that such a situation did not impact the validity of the conclusions of this study (see Supplementary Text S4). Second, about 30% of SNPs in the MK-61 panel were identified by a few comprehensive GWASs (e.g., refs. 37, 38) that combined the CGEMS and/or BPC3 datasets and the data of other cohorts. On the basis of the available information, we were unable to determine which significant associations would be missing if the CGEMS (or BPC3) cohort had been excluded from the analyses. Nevertheless, it should not be too bold to assume that the risk of overfitting due to this confusion in marker panel testing was trivial. The reason was that the effect of the marker SNPs identified by the meta-analyses on the disease status stratification of the subjects in the CGEMS (or BPC3) cohort was less significant than that of the SNPs directly identified by the original study, which only slightly influenced our results as shown by Supplementary Text S4.
It has been challenging to get clinical benefits from the integration of PSA and SNP information. Compared with PSA, the predictive strength of individual SNPs for risk of prostate cancer is much weak. If both of them are included in the same statistical classification model, the risk alleles' effects could be masked by the effect of PSA. This is indicated by our results from using the MLR-psa&snp method. Two-model approach may be a preferable option and was adopted in a recent publication, where a case–control data of approximately 5K Finnish men were analyzed (9). The authors first implemented a logistic regression model to scan 66 risk loci to get the estimates of risk allele effects on prostate cancer occurrence, from which the PGSs of case and control subjects were calculated. Then, another logistic regression model was used to integrate PSA and PGS to establish a classifier, on which the integrative risk scores of validation samples were estimated. Unfortunately, the holdout validation samples were not excluded from the step of estimating allele effects such that the estimated predictive accuracy could be too good (AUR = 0.967) to be accepted with confidence. Our analysis deliberately avoids such a pitfall when implementing the proposed two-model method, that is, SVM-MLR-psa&snp algorithm. Our algorithm makes use of the merits of SVM and logistic regression. The former is powerful in that it can simultaneously handle thousands of independent or correlated features and capture the non-linear relationships among variables. The latter is suitable for low-dimension data and the results are interpretable.
It is well-known that AA patients with prostate cancer have a higher mortality rate than EA patients (39). This racial disparity can be further aggravated by the in-equivalence of screening efficiency for the disease. As shown by this study, although SNP markers seem helpful to prostate cancer screening and/or diagnosis in EA men, their application prospect in AA men is still elusive due to the low predictive strength of an SNP panel when it is applied to the MEC-AA cohort. An explanation for this phenomenon is that, in the AAs' data, only a small fraction (<16%) of the variance of prostate cancer susceptibility can be explained by the used SNP panels, MK-61 or MK-61+. In an additional analysis, we compared the lists of top 200 (potential) risk loci for the three cohorts analyzed here. The results showed that five loci were shared by the lists of the CGEMS and DPC3, but only one locus (or two loci) was shared by the lists of the CGEMS and MEC-AA (or the BPC3 and MEC-AA). However, this does not imply that the SNPs associated with PCs in AAs are fundamentally different from those associated with the cancer in EAs. Although several published studies revealed differences between AAs' tumors and EAs' tumors in the expression levels of a few dozens of genes, some of which have cancer-related biological functions (40, 41), we did not find evidence that could relate these expression disparities to the SNPs that were solely strongly associated with prostate cancer occurrence in AAs or EAs. It is also worth noting that the expression disparities reported in literature could be subject to the bias due to the existence of confounding variables such as Gleason patterns and other clinicopathologic factors (42). Nevertheless, we believe that the situation may undergo some positive changes in the future. With the accumulation of more AA genomic data, more efficient SNP panels for different racial groups, especially AAs, can be pinpointed.
Authors' Disclosures
O. Sartor reports grants and personal fees from Advanced Accelerator Applications, Bayer, Constellation, Dendreon, Janssen, Progenics, and Sanofi; as well as reports personal fees from Astellas, Blue Earth Diagnostics, Bavarian Nordic, Bristol Myers Squibb, Clarity Pharmaceuticals, Clovis, EMD Serono, Fusion, Isotopen Technologien Meunhen, Myovant, Myriad, Noria Therapeutics, Inc., Novartis, Noxopharm, POINT Biopharma, Pfizer, Tenebio, Telix, Theragnostics; and reports grants from Invitae, AstraZeneca, Endocyte, Merck, SOTIO, during the conduct of the study. No disclosures were reported by the other authors.
Authors' Contributions
W. Zhang: Conceptualization, data curation, software, formal analysis, visualization, methodology, writing–original draft. Y. Dong: Validation, methodology, writing–review and editing. O. Sartor: Validation, methodology, writing–review and editing. K. Zhang: Conceptualization, formal analysis, supervision, funding acquisition, methodology, writing–original draft, writing–review and editing.
Acknowledgments
This research is supported by the NIH grants 5U54MD007595, 5P20GM103424, and 5U19AG055373, and the DOD ARO grant W911NF-20–1-0249. The authors thank the NIH and ARO for the funding support and the PLCO, CGEMS, BPC3, and MEC research groups for their datasets. The authors are grateful to the two reviewers for their constructive comments that have significantly improved this article.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.