Knowing whether a protein can be processed and the resulting peptides presented by major histocompatibility complex (MHC) is highly important for immunotherapy design. MHC ligands can be predicted by in silico peptide–MHC class-I binding prediction algorithms. However, prediction performance differs considerably, depending on the selected algorithm, MHC class-I type, and peptide length. We evaluated the prediction performance of 13 algorithms based on binding affinity data of 8- to 11-mer peptides derived from the HPV16 E6 and E7 proteins to the most prevalent human leukocyte antigen (HLA) types. Peptides from high to low predicted binding likelihood were synthesized, and their HLA binding was experimentally verified by in vitro competitive binding assays. Based on the actual binding capacity of the peptides, the performance of prediction algorithms was analyzed by calculating receiver operating characteristics (ROC) and the area under the curve (AROC). No algorithm outperformed others, but different algorithms predicted best for particular HLA types and peptide lengths. The sensitivity, specificity, and accuracy of decision thresholds were calculated. Commonly used decision thresholds yielded only 40% sensitivity. To increase sensitivity, optimal thresholds were calculated, validated, and compared. In order to make maximal use of prediction algorithms available online, we developed MHCcombine, a web application that allows simultaneous querying and output combination of up to 13 prediction algorithms. Taken together, we provide here an evaluation of peptide–MHC class-I binding prediction tools and recommendations to increase prediction sensitivity to extend the number of potential epitopes applicable as targets for immunotherapy.
Immunotherapy has emerged over the past decades to be a promising approach to personalize treatment of cancer patients. The key prerequisite of successful immunotherapy is a tumor-specific antigen that allows induction and focusing of an immune attack specifically against tumor cells. Such tumor-specific antigens could be either viral proteins, in the case of virus-driven malignancies, or mutation-derived neoantigens.
Not every possible antigenic peptide will be processed and presented on the cell surface by major histocompatibility complex (MHC) molecules and not all MHC-presented peptides are immunogenic T-cell epitopes. Immunogenicity is dependent on several factors, including protein expression, antigen processing and transport, peptide–MHC binding affinity and stability of the resulting complex, peptide competition for MHC binding, as well as the T-cell receptor (TCR) repertoire (1). Several methods have been developed to identify and to predict T-cell epitopes from a given source protein. Based on in vitro assays, such as competitive binding assays and ELISpot assays, MHC-binding affinity and immunogenicity of peptides can be experimentally assessed (2, 3). However, given the large variety of antigens and MHC allotypes, in vitro testing of all possible candidates is not feasible (4). For this reason, various in silico algorithms have been developed to predict the peptides resulting from the different steps involved in epitope presentation (5–10). The prediction of the binding likelihood of a given peptide to a specific MHC molecule is based on identified binding preferences and anchor motifs of MHC molecules (11). One of the first computational methods to predict MHC binding is SYFPEITHI (12). Other algorithms for the prediction of MHC class-I binding use various statistics, machine learning approaches, and training data sets. They facilitate prescreening of antigens and are therefore an element of (neo)epitope identification and prioritization strategies used in studies and clinical trials in the field of personalized immunotherapies (13–19).
Mass spectrometry (MS) is used to analyze the repertoire of peptides presented on the cell surface by detecting ligands eluted from MHC complexes. In contrast to MHC-binding prediction, MS identification of MHC ligands naturally includes the potentially selective steps of antigen processing and transport (reviewed in refs. 20 and 21). New MHC class-I binding predictors have been trained on MS data, with the aim to improve prediction of ligands presented by MHC (22–25). Common thresholds for discriminating binders from nonbinders indicated by the algorithms are a half-maximal inhibitory concentration (IC50) of ≤50 nM for identifying strong binders (“strong binding affinity threshold”), ≤500 nM for identifying strong and weak binders (“intermediate binding affinity threshold”), and < 5,000 nM for identifying any possible binder (“low binding affinity threshold”; ref. 26). A comparison of MS data and predicted HLA-binding affinity by Bassani-Sternberg and colleagues revealed that the most commonly used threshold for predicted binding affinity (IC50: 500 nM) is far too stringent for some HLA types (27).
Predictors are regularly updated, and methods such as MHCflurry, MSIntrinsic, or MixMHCPred are being developed (23, 24, 28). This prompted us to conduct a performance evaluation of several widely used MHC class-I binding prediction methods available online. Based on experimentally assessed binding affinity data of 743 peptides for the seven major HLA class-I types, we calculated the sensitivity, specificity, and accuracy of the predictors depending on HLA type and peptide length (29). MHC-binding predictors have been benchmarked and reviewed previously (30–34). Weekly automated benchmarking is performed for new entries to the immune epitope database (IEDB) but often covers only a small sample of peptides and HLA types (35). Here, we calculated how different common decision thresholds affect the sensitivity, specificity, and accuracy of peptide–MHC prediction results and how sensitivity can be increased by using new individually calculated decision thresholds. Additionally, we developed MHCcombine, a web application for the combination of output from several predictors, to facilitate the simultaneous use and comparison of multiple MHC class-I prediction algorithms. Thereby, we provide the means to increase the number of true HLA ligands in the prediction output, and thus the number of potential T-cell epitopes considered as candidates for immunotherapy.
Materials and Methods
The objective of this study was to provide a performance evaluation of publicly available MHC ligand prediction algorithms, based on a newly generated experimental MHC–peptide binding data set that is independent of any algorithm training data. The experimental binding data set was generated in the context of a project on therapeutic HPV16 vaccine design. Thus, it includes peptides derived from the HPV16 proteins E6 and E7, binding to the five major HLA class-I supertypes, represented by seven HLA class-I alleles. The web application MHCcombine was developed to facilitate simultaneous querying of multiple prediction methods and systematical combination of output. Likelihood of binding to each of the seven HLA molecules was predicted by 13 algorithms—NetMHC 4.0 (36, 37), NetMHC 3.4 (38, 39), NetMHCpan 4.0 (22), NetMHCpan 3.0 (40, 41), NetMHCpan 2.8 (40, 42), NetMHCcons 1.1 (43), PuickPocket 1.1 (44), IEDB recommended (45), IEDB consensus (46), IEDB SMMPMBEC (47, 48), IEDB SMM (49), MHCflurry 1.1 (28), and SYFPEITHI (12) also listed in Supplementary Table S1—for all 956 possible 8- to 11-mer peptides derived from E6 and E7. Based on the respective output format of the predictor, the decision thresholds indicated by the algorithms are either an IC50 ≤ 500 nM or a median percentile rank ≤ 2. In a first step, all peptides predicted to be binders within these thresholds were experimentally assessed for binding. As this resulted in the identification of binders beyond the threshold of individual algorithms, we extended the experimental binding analysis. Data collection was stopped when only nonbinders could be detected by experimental validation. For this reason, the experimental data are concentrated on the top list of predicted binding affinities, and sample sizes vary for each HLA type and peptide length. Each experimental affinity determination was carried out with at least three biological replicates for binders and two biological replicates for nonbinders. Positive and negative controls were included in each binding assay, and all data points from experiments with valid control results were included in the analysis. Overall, experimental binding affinity has been determined for 743 peptides in the context of a specific HLA type (Supplementary Table S2). Based on the experimental results, predictions were categorized into true positives, false positives, true negatives, and false negatives. To improve prediction sensitivity, we defined new threshold recommendations individually calculated for each prediction algorithm, HLA type, and peptide length. The performance of these new thresholds was cross-evaluated by bootstrapping (100× resampling) of the HPV data set, in order to generate results that are representative for the whole data population.
Development of the web-based tool “MHCcombine”
The web application MHCcombine was developed to allow simultaneous querying of multiple online available algorithms based on different servers for peptide–MHC binding predictions. The tool automatically combines the output of up to 12 selectable methods. It returns the combined output in the format of comma-separated values (.csv), which can be read by any text and spreadsheet editor. MHCcombine can be accessed online via http://mhccombine.dkfz.de/mhccombine/.
For the prediction of HLA class-I ligands, 13 online accessible prediction algorithms were used (accessed March 23, 2018). This study included predictive methods based on artificial neural networks [ANN; NetMHC 4.0 (36, 37), NetMHC 3.4 (38, 39), NetMHCpan 4.0 (22), NetMHCpan 3.0 (40, 41), NetMHCpan 2.8 (40, 42), and MHCflurry 1.2 (28)], scoring matrices [PickPocket 1.1 (44)], stabilized matrix method (IEDB smm; ref. 49), smm with a peptide:MHC binding energy covariance matrix (IEDB smmpmbec; refs. 47, 48), and SYFPEITHI (12), and two consensus methods [NetMHCcons 1.1 (43), IEDB consensus (46)]. We further used the method IEDB recommended, which aims to use the best possible method for a given MHC molecule based on the availability of predictors and previously observed prediction performance [Consensus>NetMHC 4.0>SMM>NetMHCpan 3.0>CombLib (45)]. These algorithms were used to predict 8-, 9-, 10-, and 11-mer peptides derived from the model protein sequences HPV16 E6 and E7 (UniProtKB: P03126 and P03129, respectively) binding to the HLA class-I alleles A*01:01, A*02:01, A*03:01, A*11:01, A*24:02, B*07:02, and B*15:01. The output of the methods, the predicted likelihood of peptide–MHC binding, was expressed as predicted half-maximal inhibitory concentration (IC50), median percentile rank (by IEDB consensus and IEDB recommended) or score (by SYFPEITHI). Initially, the threshold values for intermediate binding affinity indicated by the respective algorithms (IC50 ≤ 500 nM; median percentile rank ≤2) were applied to select peptides for synthesis and subsequent analysis. As we found binding peptides beyond the thresholds of individual algorithms (predicted to be binders by another algorithm), in a second step the thresholds were systematically lowered. This method allowed testing of weaker predicted binding affinities until only nonbinders could be detected experimentally.
Synthesis of peptides
All synthetic peptides used in this study were produced with a purity of >95%. For the solid phase synthesis, the Fmoc-strategy (50, 51) was used in a fully automated multiple synthesizer Syro II (MultiSyn Tech). The synthesis was carried out on preloaded Wang-Resins. As coupling agent, 2-(1H-Benzotriazole-1-yl)-1,1,3,3-tetramethyluronium hexafluorophosphate (HBTU) was used. The material was purified by preparative HPLC on a Kromasil 100-10C 10 μm 120 A reverse phase column (20 × 150 mm) using an eluent of 0.1% trifluoroacetic acid in water (A) and 80% acetonitrile in water (B). The peptide was eluted with a successive linear gradient of 25% B to 80% B in 30 minutes at a flow rate of 10 mL/min. The fractions corresponding to the purified peptides were lyophilized. The purified material was characterized with analytical HPLC and MS (Thermo Finnigan LCQ). Peptides were dissolved in DMSO (Sigma) at 10 mg/mL and stored in small aliquots at −80°C.
The EBV-transformed B-lymphoblastic cell lines (B-LCLs) 1341-8346 (HLA-B*07:02), BSM (HLA-A*02:01, HLA-B*15:01), E481324 (HLA-A*01:01), EA (HLA-A*03:01), FH8 (HLA-A*11:01), LKT3 (HLA-A*24:02), and WT100BIS (HLA-A*11:01) were obtained from the International Histocompatibility Working Group Cell Bank (IHWG Cell Bank) in 2011 (BSM, EA, LKT3, WT100BIS), 2012 (FH8) and 2016 (1341-8346, E481324) and cultured in RPMI-1640 supplemented with 15% fetal bovine serum (both from Sigma), 1 mM sodium pyruvate and 2 M l-glutamine (both from Corning; B-LCL medium) under standard cell culture conditions. Cells were used for experiments from passages 2 to 12. Cell lines were regularly authenticated and confirmed to be free of Mycoplasma by SNP profiling and multiplex-PCR (latest in October 2018) by Multiplexion GmbH.
Competition-based peptide–HLA-binding assays
The binding affinity of synthesized test peptides to selected HLA class-I molecules was assessed in competition-based cellular binding assays as previously published (2, 52). These assays are based on the HLA class-I binding competition of a known high-affinity fluorescein-labeled reference peptide and the test peptide of interest. In brief, cells of a B-LCL with the desired HLA expression were stripped from naturally bound peptides by citric acid buffer treatment (0.263 mol/L citric acid and 0.123 mol/L Na2HPO4) with specific pH (pH 3.1 for HLA-A*02:01, HLA-A*11:01, HLA-A*24:02, and HLA-B*07:02; and pH 2.9 for HLA-A*03:01 and HLA-B*15:01). The cells were suspended at a concentration of 4 × 105 cells/mL in the B-LCL medium containing 2 μg/mL ß2-microglobulin (MP Biomedicals) to reconstitute the HLA class-I complex. The cells were transferred to a 96-well plate, and a mixture of 150 nmol/L fluorescein-labeled reference peptide and serially diluted test peptide was added. Each test peptide was analyzed at eight different concentrations, ranging from 100 μmol/L to 0.78 μmol/L, in a minimum of three independent experiments for binders and a minimum of two for nonbinders. Fluorescence was measured by flow cytometry (FACS Canto II or FACS Accuri; BD Biosciences) and interpreted with FlowJo V10 (FlowJo, LLC). Background and maximum fluorescence was determined based on cells without peptide and cells with fluorescein-labeled reference peptide only, respectively. For every test peptide concentration, the mean percentage of reference peptide inhibition was calculated relative to the maximum fluorescence. The test peptide concentration that inhibits 50% binding of the fluorescein-labeled reference peptide was determined by nonlinear regression analysis based on the following equation (formula A; SigmaPlot V13.0, Systat Software). This half-maximal inhibitory concentration (IC50) indicates the binding affinity. Peptides were classified as binders (IC50 ≤ 100 μM) or nonbinders (IC50 > 100 μM or nonlinear regression analysis not possible) according to (52).
Performance evaluation of prediction algorithms
As per experimentally determined binding affinity of peptides, predictions were assessed to be true (T) or false (F). This classification was used to generate receiver operating characteristic (ROC) curves for each analyzed method, HLA molecule, and peptide length (SigmaPlot V13.0). This curve reflects the rate of true positives (TPR = TP/P; true predicted binders/total binders) on the y-axis versus the rate of false positives (FPR = FP/N; nonbinders incorrectly predicted as binders/total nonbinders) on the x-axis over all possible thresholds. Each tested peptide corresponds to a single point in the ROC space with discrete values for sensitivity (=TPR) and 1 − specificity (=FPR). The capacity of each algorithm to discriminate between binders and nonbinders was analyzed by calculating the area under the ROC curve (AUCROC) as an estimate of prediction performance over the range of all possible decision thresholds. A good predictor should predict a high rate of true positives (TPR, sensitivity) and a low rate of false positives (FPR, 1 − specificity). The perfect ROC curve would have a coordinate of (0;1) and consequently an area under the ROC curve (AROC) of 1. The use of AROC values for the evaluation of machine learning algorithms is well established (53). According to Lin and colleagues, AROC values above 0.9 indicate excellent prediction capability, whereas values between 0.9 and 0.8 show intermediate prediction performance and values below 0.8 indicate poor prediction capability (54).
According to the predictions using different binding affinity thresholds, peptides were categorized into predicted binders (positives, P) and predicted nonbinders (negatives, N) for each algorithm. Based on experimental validation, these predictions were categorized into true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Different threshold values were evaluated by calculating the respective sensitivity, specificity, accuracy (formula B), and the positive predictive value (PPV, formula C):
Calculation and validation of recommended decision thresholds.
Based on our experimental binding affinity data, we calculated new threshold values for each analyzed prediction algorithm, HLA type, and peptide length. These recommended threshold values were selected based on the following criteria: (i) specificity ≥0.66 (equal to FPR ≤ 0.33), (ii) TPR ≥2 × FPR, and (iii) the threshold yielding the highest possible sensitivity within the limits defined in (i) and (ii). Lacking a validation data set (a second set of similarly obtained binding affinity data), we used a bootstrapping algorithm to statistically validate our recommended threshold values and their respective sensitivity, specificity, and accuracy. In brief, the data set was randomly split into 2/3 training data and 1/3 test data per HLA allele. Applying the mentioned criteria, the optimal threshold for each prediction method was calculated based on the training data. The calculated optimal threshold was applied to the test data, and sensitivity, specificity, and accuracy were calculated. This was repeated 100 times. From these 100 runs of resampling, the median optimal threshold and the bootstrap confidence intervals for sensitivity, specificity, and accuracy were calculated. To check the reliability of the validated threshold on arbitrary data sets, a second bootstrapping was performed. Thus, we performed 100 runs of resampling a set of one third of the size of the total data set. The mean optimal threshold from the first bootstrapping as well as the strong, intermediate, and low binding affinity threshold indicated by the methods were applied to the 100 resampling sets. In each run and for each applied threshold, sensitivity, specificity, and accuracy were calculated. After 100 runs, the confidence intervals of sensitivity, specificity, and accuracy were calculated. Differences of mean sensitivity, specificity, and accuracy of applied thresholds were compared by one-way ANOVA for repeated measures followed by the Dunnett multiple comparisons test.
Comparison of criteria-based and bootstrapping-validated thresholds.
Because of limited sample sizes for individual peptide lengths, the validation of recommended thresholds was only meaningful for the analysis of pooled peptide lengths. To show that prediction accuracies obtained by criteria-based thresholds are representative for the whole statistical population, they were compared with the accuracy obtained by validated thresholds by performing two-tailed Student t tests (significance, P < 0.05).
Comparison of sequence motifs of the HPV16 E6/E7 peptide data set and known HLA-binding motifs.
Peptides of the HPV data set were sorted according to HLA type and peptide length. Any peptide with predictions within the general thresholds of IC50 ≤ 500 nmol/L or a median percentile rank ≤2 was considered “predicted.” All peptides with experimentally determined actual binding affinity were considered “tested.” All tested peptides with HLA-specific binding were considered “binders.” Sequence motifs of these peptide sets were generated using the Seq2Logo 2.0 (55) webtool with default settings [Kullback–Leibler logo type, Hobohm1 clustering method with threshold 0.63, amino acid illustration: red negatively charged side chains (D, E), green polar uncharged side chains (N, Q, S, G, T, Y), blue positively charged side chains (R, K, H), black others (C, U, P, A, V, I, L, M, F, W)]. These motifs were compared with motifs generated from HLA-type– and length-specific human epitope entries of linear peptides from the IEDB (56).
Statistical testing was performed using GraphPad Prism Version 5.04 (GraphPad Software). Bootstrap procedures for calculating optimal thresholds and validation of respective sensitivity, specificity, and accuracy have been performed using R, version 3.4.3. The R-script is available via https://github.com/DKFZ-F130/cross-validation.
HLA class-I ligand prediction does not match experimentally validated binding capacity
To exploit the individual strengths of the various in silico methods, 13 prediction algorithms were used for HLA class-I binding predictions of peptides derived from the HPV16 proteins E6 and E7. The prediction methods are listed in Supplementary Table S1. We predicted binding affinity of all 956 possible 8-, 9-, 10-, and 11-mer peptides derived from these proteins to each of the major HLA class-I types A1, A2, A3, A11, A24, B7, and B15, represented by the HLA-A*01:01, A*02:01, A*03:01, A*11:01, A*24:02, B*07:02, and B*15:01 alleles. To facilitate querying of all the algorithms and to systematically combine the resulting prediction output, we developed a web application, MHCcombine.
The algorithms did not uniformly predict the same peptides to be binders. Prediction resulted in different values for each peptide including some predicted binding affinities beyond the threshold. Next, the actual binding affinity of the peptides was verified in vitro in competitive binding assays (2). Not all peptides that were predicted to be binders in silico bound experimentally and vice versa. This justified a systematic lowering of thresholds individually for each prediction method and HLA type to increase the number of binders confirmed by experimental binding affinity assessment. We iterated this procedure until no more binders were detected. Finally, the data set comprised 743 peptide–HLA-binding assessments tested in at least two independent experiments: 44 for HLA-A1, 154 for HLA-A2, 105 for HLA-A3, 135 for HLA-A11, 128 for HLA-A24, 52 for HLA-B7, and 125 for HLA-B15 (Fig. 1; Supplementary Table S2). Sequence motifs of predicted, tested, and binding peptides of the data set resemble known motifs of HLA-type–specific epitopes included in the IEDB (Supplementary Table S3).
Applying the respective default thresholds to the results of all used predictors, 242 peptides were predicted to be binders. Experimental assessment identified 278 true binders. The predicted and experimentally verified binding only partially overlapped (Fig. 1). For example, 52 peptides were predicted to be binders to HLA-A2 (TP + FP). Of these positive predictions, 25 were true and 27 false. Additionally, 21 actual binders were falsely predicted to be nonbinders (FNs), and 81 nonbinders were TNs that were predicted correctly. As we stopped the experimental binding assessment when no more binders were detected, it is fair to assume that the remaining peptides predicted as nonbinders (with even lower predicted likelihood of binding than the tested peptides) are TNs. The following percentages are therefore only given for positive predictions, as all of these were experimentally assessed. Across all analyzed HLA types, we found that only 151 of 242 (62%) positive predictions matched with actual binding. Additionally, 127 of 278 (46%) true binders were not predicted within the given thresholds. This pronounced disparity between binding prediction and experimental assessment prompted us to conduct an analysis of the prediction performance of the used algorithms based on our HPV16 E6/E7 data set.
Prediction algorithms depend on HLA type and peptide length to discriminate binders from nonbinders
To assess the predictive performance of the individual algorithms, predictions were sorted according to the predicted binding likelihood and classified by the experimental binding result. By considering each verified binder as a positive event and each nonbinder as a negative event, performance can be analyzed in ROC curves, and the predictive strength of an algorithm independent of a threshold can be evaluated (Supplementary Fig. S1A).
ROC curves for 12 predictors are shown in Fig. 2 for HLA-A2, A3, and A24, and in Supplementary Fig. S1B for HLA-A1, A11, B7, and B15. None of the methods perfectly discriminated binders from nonbinders. Considering all peptides, regardless of their length, the 12 algorithms analyzed in these two figures differed only slightly in their prediction performance. However, differences became more pronounced when analyzing every peptide length and HLA type separately. Performances for 9- and 10-mer peptides were still very similar among predictors. They performed better than for the pooled lengths, and for HLA-A3, they reached excellent AROC values of > 0.9. For 11-mers, prediction performance was dependent on the HLA molecule. Prediction of 11-mers binding to A24 yielded similar performance results between predictors, but only intermediate AROC values of ∼0.8. The poorest 11-mer prediction performance was observed for HLA-A3 for which AROC values dropped below 0.5, which equals a random assignment of binders and nonbinders. For HLA-A2, 11-mer peptides were discriminated poorly with considerable performance differences between methods (AROC 0.59–0.82). The analysis of 8-mer peptides showed the most pronounced differences in method performance. Here, AROC values ranged from 0.17 (IEDB SMM for HLA-A24) to 0.91 (NetMHC 4.0 for HLA-A3). Similar observations could be made for the other analyzed HLA types.
Not all algorithms allow every peptide prediction. For example, IEDB SMM and IEDB SMMPMBEC cannot predict 8- and 11-mer peptides for HLA-B15. The 13th analyzed predictor, SYFPEITHI, only allows prediction of 9- and 10-mer peptides for most HLA types. Prediction of 11-mers by SYFPEITHI is possible only for HLA-A1. Further, it does not offer prediction for HLA-B15. Because of this limitation in data, a scoring system that is different from all other analyzed predictors, and an unclear definition of predicted binders and nonbinders, SYFPEITHI was analyzed separately (Supplementary Fig. S2). AROC values were found to be below 0.8 for all HLA types and peptide lengths, indicating poorer performance than more regularly updated prediction algorithms.
In general, ANN-based pan-specific algorithms showed the best prediction performance. The AROC values of the NetMHC family were always among the highest. In contrast, IEDB SMM and IEDB SMMPMBEC could be found among the poorly performing predictors for most of the analyzed settings. Surprisingly, the newest predictors NetMHCpan 4.0 and MHC flurry 1.2 were not able to distinctively outperform other methods. Indeed, no single algorithm performed outstandingly well. Thus, we recommend always choosing the most suitable algorithm for the specific HLA type and peptide length in question.
Commonly used decision thresholds result in low prediction sensitivity
As outlined above, we observed that actual binders could still be found beyond the commonly used binding affinity thresholds indicated by the prediction methods. Therefore, we were interested in the predictors' performance when different decision thresholds were applied. We compared the accuracy, defined as the ratio of all true predictions (TP + TN) over all data points, for indicated thresholds predicting for strong (IC50: ≤ 50 nM; percentile rank: ≤ 0.5), intermediate (IC50: ≤ 500 nM; percentile rank: ≤ 2), and low binding affinity (IC50: ≤5,000 nM). Figure 3A shows the accuracies for predictions to HLA-A2, A3, and A24 for pooled and single peptide lengths from 8- to 11-mers; Supplementary Fig. S3A shows the same analysis for HLA-A1, A11, B7, and B15. The highest accuracy that can be achieved is 1.0. For pooled peptide lengths, accuracy for the different thresholds varies between 0.50 and 0.85. Here, the predictors showed very similar accuracy for the same threshold. Differences between predictors again became more pronounced when single peptide lengths were analyzed. Overall, the strong binding affinity threshold resulted in the most stable accuracy across different predictors, but also resulted in the lowest accuracy values. For example, using the strong binding affinity threshold for the prediction of 8-mer binders to HLA-A24 yielded the lowest accuracy (0.24) among all predictors. Applying the intermediate binding affinity threshold showed slightly better accuracies in the pooled length analysis, and some very high accuracy values in single length analysis. We could not observe any major improvement in accuracy for a specific peptide length, but HLA type–dependent differences. The highest variance in accuracy across methods was observed for the low binding affinity threshold. Using this threshold, accuracy dropped dramatically for single prediction methods such as PickPocket 1.1, IEDB SMM, and IEDB SMMPMBEC. For A24, the low binding affinity threshold performed with superior accuracy compared with the intermediate and strong binding affinity thresholds. In contrast, for A2 and A3, the low binding affinity threshold generally performed least accurately. Similar observations could be made for the other analyzed HLA types. The more tolerant thresholds improved prediction performance for the HLA types A1, A24, B7, and B15 and worsened prediction performance for A2, A3, and A11 (Fig. 3A; Supplementary Fig. S3A).
As accuracy considers all true predictions, it may mask a low capability to predict TPs by correctly predicting a high number of TNs. We further analyzed the threshold-dependent prediction performance of methods by calculating sensitivity and specificity. The threshold-dependent sensitivity and specificity of binding predictions to HLA-A2, A3, and A24 are shown for pooled 8-mer to 11-mer peptides in Fig. 3B, and for HLA-A1, A11, B7, and B15 in Supplementary Fig. S3B. As expected, specificity was found to be highest and sensitivity lowest for the most stringent strong binding affinity threshold (column “strong” in the indicated figures). Specificity decreases and sensitivity increases naturally when using more tolerant thresholds (columns “inter” and “low” in the indicated figures). For HLA-A2, applying the commonly used intermediate binding affinity threshold of IC50 ≤ 500 nM resulted in a maximum sensitivity of only 0.39 (for PickPocket 1.1 and IEDB SMMPMBEC) with a coinciding minor reduction of specificity (Fig. 3B, panel HLA-A2, column “inter”). Using the low binding affinity threshold increased the sensitivity to > 0.54 (Fig. 3B, panel HLA-A2, column “low”). For predictions by PickPocket 1.1, the IC50 ≤ 5,000 nM threshold increased sensitivity to 0.93, but this was accompanied by a pronounced reduction of specificity to 0.36. This trend could be observed for all predictors, however to a lesser extent (Fig. 3B; Supplementary Fig. S3B, columns “low”).
For HLA-A3, using the intermediate binding affinity threshold resulted in highest sensitivity for IEDB consensus (0.52) and IEDB recommended (0.48; Fig. 3B, panel HLA-A3, column “inter”). These methods return a percentile rank as prediction score and do not indicate a third low binding affinity threshold. For HLA-A24, A1, B7, and B15, IEDB recommended and IEDB consensus showed the highest sensitivity across tested methods when the intermediate binding affinity threshold was applied (Fig. 3B; Supplementary Fig. S3B, columns “inter”).
Recommended individual decision thresholds increase prediction sensitivity
For projects aimed at finding all possible HLA-binding peptides from a given protein, the gain in sensitivity by using more tolerant decision thresholds is attractive if it can be balanced against decreased specificity. However, our analysis showed that it is not generally favorable to use the low binding affinity thresholds indicated by the prediction algorithms. We calculated new threshold recommendations individually for each predictor, HLA type, and peptide length. Our recommendations are based on the following criteria: (i) a minimum specificity of 0.66 (equal to a maximum FPR of 0.33), (ii) prediction of at least twice as many TPs than false positives, and (iii) calculation of the threshold yielding the highest possible sensitivity within the limits defined in (i) and (ii). The application of these criteria is shown for HLA-A2 and two selected predictors (NetMHC 4.0 and IEDB SMM) in Fig. 4A. For predictors showing high AROC values (e.g., for ANN methods predicting for HLA-A2 binding), applying these criteria often resulted in thresholds even more tolerant than the low binding affinity thresholds indicated by the algorithms. However, for the more poorly performing methods (e.g., IEDB SMM predicting for HLA-A2 binding), threshold recommendations were often found between the intermediate and low-affinity thresholds. For single-length predictions, all threshold recommendations and associated performance measures (AROC, PPV, specificity, and sensitivity) are listed in Table 1.
In order to recommend thresholds applicable to any data set, we calculated criterion-based threshold recommendations for every HLA type and for every predictor evaluated in this study, based on bootstrapping of the HPV data set. This method resamples the HPV data set multiple times to calculate the recommended threshold that best represents the entire statistical population. The median criterion-based threshold of 100 bootstraps was termed the “validated threshold.” In a second bootstrapping, the sensitivity, specificity, and accuracy associated with the validated threshold were compared with the intermediate and low binding affinity thresholds indicated by the prediction algorithms. Results for HLA-A2 are shown in Fig. 4B; results for all other analyzed HLA types are shown in Supplementary Fig. S4. For HLA-A2, A3, A11, and A24, all predictors yielded a significant and relevant (change by ≥0.1) increase in sensitivity, using the respective bootstrapping-validated threshold in comparison with the intermediate affinity threshold. This was also true for the majority of prediction methods for HLA-A1, B7, and B15. For predictions to HLA-A2, A1, A11, and A24, sensitivity of most predictors could be significantly increased applying the bootstrapping-validated threshold compared with the low binding affinity threshold. For all HLA types, some predictors, such as IEDB SMMPMBEC and IEDB SMM, showed decreased sensitivity when respective bootstrapping-validated thresholds were applied. In this case, the respective validated threshold was found to be more stringent than the low binding affinity threshold. As expected, when sensitivity was lost, concomitant specificity was gained and vice versa. However, in almost all cases the absolute gain outweighed the loss. Therefore, for HLA-A2, accuracy was relevantly increased (validated vs. low binding affinity threshold for PickPocket 1.1, IEDB SMMPMBEC, and IEDB SMM), not significantly changed (for validated vs. intermediate binding affinity threshold for PickPocket 1.1 and IEDB recommended, for validated versus low binding affinity threshold for NetMHC 4.0, NetMHC 3.4, and NetMHCflurry 1.2, both for NetMHCpan 2.8 and NetMHCcons 1.1) or reduced only to a minor extent (Fig. 4B). In the HLA types A24, B7, and B15, use of the validated threshold led to relevant increases in accuracy (Supplementary Fig. S4). Accuracy changes for HLA-A11 were either not significant or less relevant (<0.1). Comparing the validated threshold with the low binding affinity threshold, most methods showed an increase in accuracy for HLA-A3. For HLA-A1 changes in resulting accuracy varied.
Criteria-based thresholds match well with bootstrapping-validated thresholds
Due to the sample size of our experimental data set, this analysis was performed only for pooled peptide lengths. To investigate how well the performance calculated for the HPV data set sample represents the true performance of predictors, we directly compared the respective accuracies of criteria-based and bootstrapping-validated thresholds for each predictor. Results for HLA-A2, A3, and A24 are shown in Fig. 5, and for HLA-A1, A11, B7, and B15 in Supplementary Fig. S5. In most cases, criterion-based recommendation and bootstrapping validation returned the very same thresholds. In these cases, accuracy did not differ significantly. A minority of threshold pairs varied from each other. The greater the difference between the two threshold values, the greater was the difference between associated accuracies. However, even when validation calculated a different threshold, the accuracy did not necessarily differ. This was true for, e.g., IEDB SMMPMBEC predicting HLA-A2 affinity and NetMHC 4.0, NetMHCpan 4.0, and IEDB SMM predicting HLA-A24 affinity.
Applying the recommended thresholds increases the number of predicted true binders
The use of criteria-based thresholds for individual peptide lengths and the bootstrapping-validated thresholds for the pooled lengths increased the amount of true binders among the predicted peptides of the HPV data set. This is illustrated for HLA-A2 in Fig. 6 and for HLA-A1, A3, A11, A24, B7, and B15 in Supplementary Fig. S6. The first column of each heat map represents the amount of peptides experimentally verified to be true binders (blue) or nonbinders (red). The following columns illustrate the categorized prediction output of each predictor as indicated in the figure legend, differentiating between peptides predicted either within the respective intermediate threshold only (dark red), the recommended threshold only (blue), the consensus of both thresholds (dark blue), or predicted to be nonbinders (red). For HLA-A2, applying the common thresholds of IC50 ≤500 nM and percentile rank ≤ 2 identified only one third of assessed true binders at best (Fig. 6) and resulted in several false-positive predictions. For the 8 and 11 residue–long peptides, the common thresholds often cover only few true binders. For 9- and 10-mers, the intermediate binding affinity thresholds are more suitable. The amount of predicted true binders could be increased by using the more tolerant recommended thresholds (predicted binders in green, predicted nonbinders in red). Applying the recommended thresholds also increased the number of FPs. This applies to other HLA types as well (Supplementary Fig. S6).
Using an experimentally verified binding affinity data set of more than 750 peptides derived from HPV16 E6 and E7 proteins, we evaluated the performance of 13 currently used MHC class-I binding predictors for seven major HLA types for pooled and single length 8-, 9-, 10-, and 11-mer peptides. We found that, overall, current ANN methods perform superiorly to matrix-based approaches, which is in line with previous findings (54). No single prediction method performed outstandingly in discriminating binders from nonbinders, but performance was dependent on the selected HLA type and peptide length. Even the ANN predictor MHCflurry 1.2 and the MS data trained NetMHCpan 4.0 was not generally able to outperform other methods. Thus, we recommend using either multiple methods or the algorithm that performs best for the specific HLA type and peptide length of interest. To make this feasible, we developed the web application MHCcombine that allows querying 12 of the 13 algorithms analyzed in this study and combines the output in an ordered and searchable file. In contrast to a tool previously developed by Trost and colleagues, MHCcombine returns the individual output of several MHC-binding prediction methods and does not heuristically combine it. Further, MHCcombine allows predicting not only 9-mers, but also 8-, 10-, 11-, and 12-mers, if this is offered by the prediction algorithm and is publicly available as a web-based tool.
Currently available MHC class-I binding prediction algorithms are trained on different sets of data derived from databases such as the IEDB. These databases comprise different numbers of biochemically determined MHC-binding peptides, MS-detected naturally processed and presented MHC ligands, or reported true T-cell epitopes for defined HLA types. In this study, we analyzed the prediction methods solely as predictors of MHC class-I binding. It is possible that if a predictor was trained on anything else than MHC-binding data, this may have negatively influenced the prediction performance as assessed herein. Apart from NetMHCpan 4.0 (mentioned above), MHC-NP, and the previously developed MSIntrinsic and MixMHCPred prediction algorithms have been trained on MS-derived MHC ligand data sets (23–25). Although there is a role for MS data sets in prediction algorithms (reviewed in ref. 57), we did not include the latter three methods in our analysis as they either do not provide prediction for our HLA types of interest or are not available as easy-to-use web applications. In addition, the MS data-based algorithms are trained on peptides that were naturally selected for peptide processing, MHC-binding affinity, peptide binding competition for MHC molecules, and bona fide peptide presentation. As our data set comprises only MHC-binding affinity data, we found it not suitable to evaluate the performance of those methods, but focused on analyzing predictors for MHC binding.
As our approach was to determine binding affinities until no binder could be found anymore, it neither reflects the whole range of predicted binding affinities nor does it contain all possible E6- and E7-derived peptides. This selection of peptides might create a bias, as binding peptides with poor prediction scores may have been missed, and the predictors' ability to correctly predict positives could have been overestimated. Depending on the predictions and the binding assay results, sample sizes differ for each HLA type and peptide length. The HLA types A1 and B7 as well as 8-mer and 11-mer peptides are underrepresented in this study, and results for these should be interpreted with caution. The preference of certain amino acids at anchor positions was comparable, although the HPV16 data set was too small to derive regions of variability or conservation. Based on the similarity in anchor positions, we do not expect significant bias resulting from the data set derived from only two viral proteins.
Previous findings indicate that many MHC ligands might be missed when the routinely used intermediate binding affinity decision thresholds (IC50 ≤ 500, percentile rank ≤ 2) are applied (27, 58, 59). Indeed, comparing the sensitivity of predictors at these intermediate binding affinity thresholds showed that this yielded a maximum sensitivity of only 40%. We analyzed the impact of more tolerant thresholds on prediction performance and showed that applying the low binding affinity threshold (IC50 ≤ 5,000) did not generally improve prediction performance. Therefore, we calculated new threshold recommendations based on our data set. Paul and colleagues have previously shown that allele-specific affinity thresholds increased the predictive efficacy, and recommended HLA type–specific binding affinity thresholds for 38 HLA-A and HLA-B types (60). For 27 of these 38 HLA types, they recommended IC50 values > 500 nM and < 1,000 nM. For the remaining 11 types, among them A2 and A11, thresholds below 500 nM were recommended. However, this study focused only on 9-mer peptides and the SMM algorithm and thus these thresholds are not generally applicable. With our study, we provide more comprehensive threshold recommendations individually for 12 different and currently used predictors, for seven major HLA types, as well as for pooled and single peptide lengths.
The historic threshold of IC50 ≤ 500 nM based mostly on 9- and 10-mer HLA-A2 binders (26) has been commonly used across different HLA types and peptide lengths since the mid-1990s. We demonstrate here that this threshold is not generally suited and applicable to identify binders when using MHC class-I binding prediction methods. Although strong binding affinity to MHC molecules was described to correlate well with therapy outcome (e.g., tumor rejection), it might not be the best predictor of therapeutically relevant human “self” tumor-specific epitopes, as it is likely that T cells specific to these have been deleted during negative selection in the thymus (59, 61). We previously showed that immunogenic HPV16-derived A2 peptides can be found with predicted IC50 > 500 nM (6/11 epitopes, up to an IC50 value of 4,351 nM predicted by NetMHC 4.0; ref. 62). It was reported that neoepitopes, which elicit protective immunity in mice, may have an MHC affinity well over IC50 values of 500 nM (63). Thus, we question the continued suitability of the intermediate binding affinity threshold for predicting true T-cell epitopes.
Accurate prediction of MHC-binding affinity is a prerequisite for the identification of T-cell epitopes. After prediction, peptides need to be tested for actual immunogenicity to identify promising therapy candidates. We previously identified immunogenic epitopes among our data set of HLA-A2 binding peptides through IFNγ responses of peripheral blood mononucleated cells from healthy donors (62). Four selected epitopes were validated in vivo in an MHC-humanized mouse model (64). This demonstrates that MHC class-I binding prediction, applying the commonly used strong and intermediate binding affinity thresholds, provides high prediction specificity. In therapeutic settings where many different antigens are available, few candidates per antigen might be enough to find a true epitope among them. However, for immunotherapeutic approaches in which possible antigen is rare, it becomes necessary to test any possible MHC binder in order to detect a true epitope. In this case, it is reasonable to increase the prediction sensitivity by applying the more tolerant thresholds recommended in this study, which describes how to choose the best online available prediction method for HLA type– and peptide length–specific research questions. Ultimately, improvement of MHC class-I binding prediction methods will facilitate the identification of true T-cell epitopes and thereby advance the development of epitope-specific immunotherapies.
Data and Materials Availability
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Conception and design: M. Bonsack, S. Hoppe, A.B. Riemer
Development of methodology: M. Bonsack, S. Hoppe, J. Winter, D. Tichy, C. Zeller, R. Blatnik
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): M. Bonsack, S. Hoppe, J. Winter, M.D. Küpper, E.C. Schitter, R. Blatnik
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): M. Bonsack, S. Hoppe, D. Tichy, R. Blatnik
Writing, review, and/or revision of the manuscript: M. Bonsack, S. Hoppe, J. Winter, D. Tichy, R. Blatnik, A.B. Riemer
Study supervision: A.B. Riemer
Other (development and testing of web application MHCcombine): M. Bonsack, S. Hoppe, J. Winter, C. Zeller
We thank Martin Wühl and Alexandra Klevenz for technical assistance, and Cyril Mongis and Tobias Reber for their contributions to the implementation of MHCcombine. We gratefully acknowledge the flow cytometry core facility and the GMP unit of the German Cancer Research Center (DKFZ) for technical support and peptide synthesis, respectively. General funding and support was provided by DKFZ. A.B. Riemer was funded by the German Center for Infection Research (DZIF, grant number TTU 07.706), M. Bonsack and S. Hoppe by a PhD scholarship from the Helmholtz International Graduate School of the DKFZ, and R. Blatnik by a PhD scholarship from the Eduard and Melanie zur Hausen Foundation.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.