The success of cancer immunotherapy relies on the ability of cytotoxic T cells to specifically recognize and eliminate tumor cells based on peptides presented by HLA-I. Although the peptide epitopes that elicit the corresponding immune response often remain unidentified, it is generally assumed that neoantigens, due to tumor-specific mutations, are the most common targets. Here, we used a mass spectrometric approach to show an underappreciated class of epitopes that accounts for up to 15% of HLA-I peptides for certain HLA alleles in various tumors and patients. These peptides are translated from cryptic open reading frames in supposedly noncoding regions in the genome and are mostly unidentifiable with conventional computational analyses of mass spectrometry (MS) data. Our approach, Peptide-PRISM, identified thousands of such cryptic peptides in tumor immunopeptidomes. About 20% of these HLA-I peptides represented the C-terminus of the corresponding translation product, suggesting frequent proteasome-independent processing. Our data also revealed HLA-I allele–dependent presentation of cryptic peptides, with HLA-A*03 and HLA-A*11 presenting the highest percentage of cryptic peptides. Our analyses refute the reported frequent presentation of HLA peptides generated by proteasome-catalyzed peptide splicing. Thus, Peptide-PRISM represents an important step toward comprehensive identification of HLA-I immunopeptidomes and reveals cryptic peptides as an abundant class of epitopes with potential relevance for novel immunotherapeutic approaches.
Immunotherapeutic approaches, such as adoptive T-cell therapy or therapeutic peptide vaccination, are among the most promising approaches to treat cancer (1). Their success demonstrates that cytotoxic T cells are able to specifically recognize and eliminate tumor cells via peptides presented by HLA-I. These epitopes include peptides carrying a tumor-specific mutation (neoantigens), or are derived from germline antigens. However, other sources, such as epitopes generated by proteasome-catalyzed peptide splicing (PCPS) have been proposed (2). HLA-I presented peptides can be analyzed by mass spectrometry (MS) of purified HLA-I complexes. However, the identification of peptides that are not encoded in the proteome is challenging. We demonstrated here that this can lead to erroneous identification of neoantigens, and, in line with previously published concerns (3), to unwarranted claims of frequent presentation of PCPS peptides. The lack of advanced approaches for proteogenomic identification of HLA-I epitopes precludes the comprehensive analysis of the contribution of aberrant transcription and translation to HLA-I immunopeptidomes.
Ribosome profiling (Ribo-seq) has provided compelling evidence for the translation of thousands of open reading frames (ORF) outside of the annotated proteome (4). These cryptic ORFs are encoded in 5′- and 3′-untranslated regions (UTR), noncoding RNAs, intronic and intergenic regions, and within coding sequences shifted with respect to the canonical reading frame. In spite of overwhelming evidence for their translation, only a limited number of their translation products have been detected by MS so far, even when specifically adopted isolation methods and database search approaches have been applied (5, 6). This indicates that cryptic translation products are short-lived, but are exploited as a source for HLA-I peptides, as has been described for defective ribosomal products (7–10). Alternatively, cryptic translation products might arise from noncanonical translation events, potentially translated from a specialized subpopulation of ribosomes (immunoribosomes; ref. 11). These might channel their translation products directly to the HLA-I antigen-processing machinery (12). Both, the short half-life and direct channeling explain why cryptic translation products are hardly detected in the proteome. Both mechanisms should lead to a higher incidence of cryptic HLA-I peptides. By analyzing HLA-I immunopeptidome data, we previously found that at least 2% of the peptides bound to HLA-I in human fibroblasts originate from cryptic (i.e., unknown) ORFs identified by Ribo-seq (13). This indicates that cryptic peptides are indeed enriched in the HLA-I immunopeptidome. A study has employed RNA sequencing (RNA-seq) in parallel to HLA immunopeptidomics and identified 168 cryptic peptides from an Epstein–Barr virus–transformed B-cell line (14). Both studies were based on the established software developed for proteomics data and on sequence databases derived from the Ribo-seq or RNA-seq data. Here, we present a fundamentally different strategy, termed Peptide-PRISM, to identify cryptic peptides based on mass spectrometric data alone. This enabled us to analyze a large collection of HLA immunopeptidomes without using additional sequencing data, and to identify 6,636 cryptic peptides from various types of tumors from several patients with high confidence. This reveals cryptic HLA peptides to be a substantial part of tumor immunopeptidomes.
Materials and Methods
Datasets for analysis
All immunopeptidome datasets analyzed in this study are listed in Supplementary Table S1. We included datasets from different types of tumor samples (melanoma, ref. 15; lung cancer, ref. 16; glioblastoma, ref. 17; triple-negative breast cancer, ref. 18; and mantle cell lymphoma, ref. 19). We also included two datasets from publications that report the identification of HLA-I peptides derived from PCPS (20, 21). We only included datasets with fragment ion spectra providing high mass accuracy (i.e., fragment ion spectra acquired with Orbitrap mass analyzer). For all analyses, we exclusively used raw MS data.
De novo sequencing
De novo sequencing was performed with PEAKS X (refs, 22, 23; Bioinformatics Solutions Inc.). Raw data refinement was performed with the following settings: (i) Merge Options: no merge; (ii) Precursor Options: corrected; (iii) Charge Options: no correction; (iv) Filter Options: no filter; (v) Process: true; (vi) Default: true; and (vii) Associate Chimera: yes. De novo sequencing was performed with Parent Mass Error Tolerance set to 15 ppm for dataset PXD004894 (melanoma) and with 10 ppm for all other datasets. Fragment Mass Error Tolerance was set to 0.015 Da, and Enzyme was set to none. The following variable modifications were used: Oxidation (M), pyro-Glu from Q (N-term Q), and carbamidomethylation (C). A maximum of three variable posttranslational modifications were allowed per peptide. Up to 10 de novo sequencing candidates were reported for each identified fragment ion mass spectrum, with their corresponding average local confidence (ALC) score. Because we applied the chimeric spectra option of PEAKS X, two or more TOP10 candidate lists could be assigned to a single fragment ion spectrum. Two tables (“all de novo candidates” and “de novo peptides”) were exported from PEAKS for further analysis.
To efficiently search ultralarge sequence databases, first, a keyword trie was built from all de novo candidate sequences. A trie is a data structure to store keywords and enable parallel searching a text for all keywords using the Aho–Corasick algorithm (24). To account for additional variable modifications (pyro-Glu from N-terminal Glu and deamidation at asparagine when followed by glycine), and for the isobaric Leu and Ile, all possible combinations of the corresponding sequences were inserted into the trie for each de novo candidate. Then the Aho–Corasick algorithm was employed, scanning through the 6-frame translation of the genome (reference assembly HG38) and 3-frame translation of the transcriptome (Ensembl 90). Optionally, on-the-fly generated proteasome-spliced peptides (normal and reverse cis-spliced and maximal intervening sequence length of 25 amino acids from all annotated proteins, as described ref. 20), all possible on-the-fly generated single amino acid substitutions from annotated proteins (i.e., at each position in each protein each of the 18 substitutions, excluding Leu), and all possible on-the-fly generated frameshift peptides (-3,-2,-1,1,2,3 nucleotide shifts at each position in each annotated protein) were scanned as well. All considered sequences were additionally reversed (called the decoy database) and scanned.
All identified string matches from the 6-frame translated genome and 3-frame translated transcriptome were categorized as follows: (i) Coding sequence (CDS): in-frame with annotated protein; (ii) 5′-UTR: contained in annotated mRNA, consistently with its introns, overlapping with 5′-UTR; (iii) Off-Frame: off-frame contained in the coding sequence, consistently with its introns; (iv) 3′-UTR: all others that are introns, consistently contained in an mRNA; (v) noncoding (nc) RNA: consistently contained in an annotated ncRNA; (vi) Intronic: intersecting any annotated intron; or (vii) Intergenic. For each fragment ion mass spectrum, the category with highest priority (CDS > 5′-UTR > Off-Frame > Frameshift > 3′-UTR > ncRNA > Substitution > Intronic > Intergenic > PCPS) was identified, and all other hits among the 10 de novo candidates were discarded.
Next, if more than one fragment ion mass spectrum for the same peptide was seen in the dataset, only the fragment ion mass spectrum with maximal score among all these was retained. Finally, if the remaining candidate for a spectrum was a, the originally best candidate (irrespective of whether it was found in the sequence database or not) was f, and the next best distinct remaining candidate in the same category but a distinct sequence was n, the spectrum was discarded if ALC(a)<ALC(f)-δf or ALC(a)<ALC(n)+δn. Here, we set δf = 15 (the maximal difference to the originally top candidate), and δn = 16 (the minimal difference to the next best candidate sequence from the same category). Isobaric Ile/Leu and additionally considered modifications were not treated as distinct here.
This procedure resulted in a list of unique peptide sequences annotated with its best ALC score, its category, and whether it is from the target or decoy database. In principle, after discarding all categories but CDS, a standard target-decoy approach can be utilized to filter proteome-derived peptides by the classic FDR.
FDR control in Peptide-PRISM
Peptide-PRISM was built on the following mixture modeling approach. For a given length-specific score distribution with density pl(x) modeling true hits (i.e., peptides that actually were in the sample), and a second length-specific score distribution with density nl(x) modeling false hits, the overall score distribution of all filtered peptides of a given length l and category/stratum c is given by the density fl,c(x) = wl,c pl(x)+(1-wl,c) nl(x). Here, wl,c is the total fraction of true hits with length l from category c. Note that the same two component distributions were used for all categories. For the rationale of this model, see Supplementary Data S1 (2, 3, 20, 25–30).
For each observed peptide length, category, and target/decoy status, Peptide-PRISM first built histograms of the (integer) ALC scores of the filtered peptides. Then, for each peptide length, all decoy histograms were summed up to fit the false-hit score distribution using unimodal penalized B-spline regression (31). The true hit score distribution was fit in the same manner after subtracting the CDS decoy histogram from the CDS target histogram. For the rationale of these approaches, see the Supplementary Data S1. Then, for each peptide length l and category c, wl,c was estimated by maximum likelihood (30) based on the respective score histograms of hits in target sequences. The expected number of true and false targets per ALC score was computed by multiplying wl,c and (1-wl,c), respectively, with the total number of identified peptides with length l in category c. Finally, these expected numbers of true and false targets were then used to compute FDRs per peptide length and category.
Putative ORF identification
Each remaining peptide was identified to originate from a location on the genome, or potentially several locations if the same sequence occurred multiple times in its category (or any Ile, Leu/variable modification variant). For each location, all potential ORFs were identified. More than one ORF can exist because all paths via exon–exon junctions of annotated transcript isoforms were considered in addition to the path on the genome. Because translation might initiate at non-AUG start codons (13), two kinds of ORFs were considered. For nonprioritized ORFs, along each path in the genome or transcriptome upstream of the peptide location, the closest in-frame start codon candidate was identified (one of AUG, CUG, GUG, ATC, and ACG). For prioritized ORFs, any closest in-frame AUG was identified. If none was found, the closest CUG was identified, followed by GUG, ATC, and ACG. From all remaining start codon candidates, the one closest to the peptide was chosen for both kinds of ORFs. The same procedure was repeated downstream of the peptide location to identify the stop codon (only nonprioritized). Finally, for peptides with more than one location, the location giving rise to the shortest prioritized ORF was chosen.
Prediction of HLA-I peptide binding
NetMHCpan 4.0 (32) was run on all final remaining peptide sequences with the HLA-I alleles given in the original publication for the corresponding patient or cell line. The allele with the minimal rank reported by NetMHCpan was annotated. As per default, we used a cutoff of 0.5% rank for strong binders and 2% rank for weak binders. For samples with unknown HLA-I genotype, Gibbs clustering was performed with GibbsCluster 2.0 (33), and alleles were manually assigned by comparing motifs from clustering results with known HLA-I peptide motifs (34).
MS analysis of synthetic peptides
Peptides were ordered from JPT Peptide Technologies. Peptides were dissolved in 2% acetonitrile, and nanoLC-MS/MS analyses were performed on an Orbitrap Fusion (Thermo Fisher Scientific) equipped with a PicoView Ion Source (New Objective) and coupled to an EASY-nLC 1000 (Thermo Fisher Scientific). Peptides were loaded on capillary columns (PicoFrit, 30 cm × 150 μm ID, New Objective) self-packed with ReproSil-Pur 120 C18-AQ, 1.9 μm (Dr. Maisch GmbH), and separated with a 30-minute linear gradient from 3% to 30% acetonitrile and 0.1% formic acid at a flow rate of 500 nL/minutes. Both MS and MS-MS scans were acquired in the Orbitrap analyzer with a resolution of 60,000 for MS scans and 15,000 for MS-MS scans. Higher energy collisional dissociation (HCD) with 35% normalized collision energy was applied. A top speed data-dependent MS-MS method with a fixed cycle time of 3 seconds was used. Dynamic exclusion was applied with a repeat count of 1 and an exclusion duration of 30 seconds. Singly charged precursors were excluded from selection. EASY-IC was used for internal calibration.
The data and scripts to reproduce all figures are available at Zenodo (DOI: 10.5281/zenodo.3775934); they can be accessed by visiting https://doi.org/10.5281/zenodo.3775934. Peptide-PRISM is available free of charge for academic use at http://software.erhard-lab.de.
Sensitive and reliable identification of HLA-I peptides by Peptide-PRISM
The standard analysis workflow for MS-based proteomics relies on sequence database search coupled with FDR estimation using the target-decoy approach (26). Commonly applied search engines, such as Mascot, Andromeda, or Comet try to identify the best peptide-spectrum match (PSM) by matching experimental and theoretical fragment ion spectra of sequence candidates. This approach works well for tryptic peptides, the most frequently analyzed sample type in proteomics, but shows only moderate sensitivity for HLA-I peptides. On the one hand, this is because of the limited length of HLA-I peptides, which limits the maximum number of matching fragment ions and the maximum matching score. On the other hand, the moderate sensitivity is caused by the missing cleavage specificity (nontryptic), which greatly increases the search space. Because de novo peptide sequencing performs well for short peptides, we hypothesized that the length restriction of HLA-I peptides of predominantly 8 to 11 amino acids could enable reliable de novo peptide sequencing based on state-of-the-art high mass accuracy tandem mass spectra. To test this, we developed a computational pipeline by combining de novo peptide sequencing, highly efficient string search, mixture modeling, and stratification of the search space for the analysis of HLA-I immunopeptidomes and termed it Peptide-PRISM (Fig. 1A). Because the quality of fragment ion spectra is often not high enough to derive a single definite peptide sequence, we generated up to 10 sequence candidates by de novo sequencing per spectrum. In a second step, all candidates were matched against a database. For the majority of spectra, not more than one of the sequence candidate was found in the database, and the correct peptide sequence could be identified by database matching of de novo sequencing candidates. In cases with multiple matching database peptides, we used a biologically motivated heuristic to select a single candidate: we stratified the database into biologically meaningful categories and prioritized these categories to select the most parsimonious candidate. For instance, if a matching sequence was found in the proteome and as part of a 5′-UTR, we selected the proteome hit as the single candidate. This is equivalent to first searching the proteome, then searching the 5′-UTR with all so far unmapped sequences, etc. (see Materials and Methods for the specifics of the prioritization). To test the performance of our new approach, we applied Peptide-PRISM for the analysis of a tumor immunopeptidome sample from a patient with melanoma (MM15; ref. 15). Database matching of de novo sequencing candidates resulted in approximately 35% more high-confidence, conventional peptides derived from proteins (1% FDR based on classic target-decoy approach) compared with the previously employed database searching approach (Fig. 1B; ref. 15). Of note, the identified peptides exhibited the same frequency of HLA-I binders predicted by NetMHCpan 4.0 (32) as the previously identified peptides (Fig. 1C). This provides evidence that the improved sensitivity was not accompanied by a loss of specificity.
In contrast to conventional database search tools, de novo peptide sequencing, in principle, allows the systematic identification of cryptic HLA peptides. To achieve this, Peptide-PRISM utilizes a highly efficient string search algorithm to search for millions of peptides in sequence databases in the order of gigabases in less than an hour. For FDR filtering, we designed a statistical method that extends the central idea from Peptide Prophet (29). Peptide-PRISM utilizes mixture modeling to deconvolute the overall de novo score distribution into components of false and true identifications. Here, we combined this approach with the stratification of the sequence search space introduced above, which was essential for maintaining statistical power. Our Peptide-PRISM approach identified 1,563 cryptic peptides at 1% FDR, corresponding to more than 4.5% of all identified peptides for MM15 (Fig 1B), and the predicted binding affinities of the cryptic peptides to HLA-I were indistinguishable from those of the conventional peptides (Fig. 1C). The median intensities of the identified cryptic peptides obtained from peak areas of extracted ion chromatograms did not show any differences compared with conventional peptides (Supplementary Fig. S1). Twenty-four of twenty-four identified cryptic peptides were successfully validated by reference spectra of synthetic peptides (Supplementary Fig. S2). In summary, these results confirmed the stringent FDR control of our approach.
To map de novo sequencing candidates, we generated a database consisting of the 6-frame translated human genome, 3-frame translated transcriptome from Ensembl (including all coding and noncoding splice variants), PCPS peptide database (normal and reverse cis-spliced peptides with a maximal intervening distance of 25), the human proteome with all possible substitutions of a single amino acid, and any peptide that could be generated from ribosomal frameshifting in the human proteome (for details see Materials and Methods). Peptides in this database can be stratified into nine categories, in addition to conventional peptides from CDSs (Fig. 1D). Each of these strata had a distinct size and likelihood for identifying peptides, which had an impact for estimating FDRs (Supplementary Data S1). Peptide-PRISM allowed for FDR control per stratum as opposed to the global FDR control by the classic target-decoy approaches. As a consequence, the relative frequencies of the peptides above and below the thresholds used for FDR filtering were distinct among strata. Therefore, counting the peptides retained after filtering was insufficient for assessing the relative composition of a sample by peptides from different strata. However, Peptide-PRISM allowed for estimating the number of peptides in each stratum without FDR filtering based on the mixture modeling approach (Fig. 1E).
PCPS peptides do not represent a large fraction of immunopeptidomes
It has been suggested that proteasomes can catalyze the ligation of two peptides, generating epitopes that do not occur consecutively in the proteome (35, 36). The contribution of such peptides generated by PCPS to HLA-I immunopeptidomes of cell lines or primary samples has raised major controversies. The percentage of PCPS peptides for the same dataset (GR-LCL) was estimated to be as high as 33% (20), 16% (2), or not more than 2% to 6% (3). We identified two major issues with the previous analyses. First, Liepe and colleagues (33% and 16% PCPS peptides; refs. 2, 20) used Mascot to perform sequence database search and the target-decoy approach for controlling the FDR. To reduce the computational burden for the search, only peptides with a precursor mass observed in the experimental data were included in the sequence database as concatemeric pseudoprotein sequences. Unfortunately, the randomization of the pseudoproteins used to generate decoys resulted in drastically underestimated FDRs (see Supplementary Data S1). Second, we found that the spectra of 517 (49%) of the reported 1,056 PCPS peptides (2) either matched to conventional peptides (n = 377; 36%) or cryptic peptides (n = 140; 13%) in our analysis (10% FDR, only binders predicted by NetMHCpan 4.0; Supplementary Table S2). These included 204 (19%) reported PCPS peptides with a sequence differing only by I/L from conventional nonspliced peptides. In our reanalysis of the GR-LCL dataset with Peptide-PRISM, only 10 spliced peptides, less than 0.1% of the identified HLA peptides, survived 1% FDR filtering.
A variety of peptides undermine FDR control
Even after stringent FDR filtering (1% FDR), Peptide-PRISM identified a number of PCPS peptides for MM15 (n = 442, 1.3% of all filtered peptides; Fig. 1D), as well as for the other datasets (Supplementary Table S1; Supplementary Fig. S3; 1% FDR). However, the remaining PCPS peptide candidates had a lower fraction of predicted HLA-I binders than that of conventional or cryptic peptides, which was in the range of decoy peptides for MM15 (Fig. 2A) and the other datasets (Supplementary Fig. S4). The de novo score (ALC) distribution of PCPS peptides resembled the ALC score distribution of decoys. For instance, approximately 30% of the PCPS peptides and approximately 26% of decoy peptides had an ALC score of ≥50 (A50 score), indicating that most were false identifications (Fig. 2B). After removing the expected false identifications by subtracting their estimated distribution from the mixture distribution, the remaining ALC scores of PCPS peptides, but not of cryptic peptides, still showed significant differences from conventional peptides (Fig. 2C; peptides with ALC score ≥90: A90 PCPS = 13%; A90 cryptic = 38%; and A90 conventional = 40%). The length distribution of PCPS peptides was significantly shifted toward longer peptides (Fig. 2D). All three phenomena indicated that the majority of the remaining PCPS peptides represented false identifications.
Not all fragment ion mass spectra contain full y ion series. In such cases, two or more sequences explain the spectrum equally well, and the wrong sequence might be identified because of spurious peaks in the spectrum. Indeed, the spliced peptide IEVGNLPSAMR (Fig. 2E) clearly represented such a misidentification due to a missing y10 fragment ion (Fig. 2F), and the parsimonious explanation for the spectrum clearly was the consecutive peptide EIVGNLPSAMR in the same locus. These isobaric ambiguities were not an issue of de novo sequencing, but inherent to MS. The majority of PCPS peptides (>73%) identified at 1% FDR for MM15 resembled this example (Fig. 2G), and the majority of these had a matching conventional peptide in the corresponding locus (Supplementary Fig. S5A and S5B). This was also true for the two datasets used in studies on PCPS (refs. 20, 37; Supplementary Fig. S5C and S5G). After filtering to 1% FDR and removing the most obvious false identifications, less than 0.3% potential PCPS peptides remained for MM15 and even less for the other datasets.
Interchanging the first two amino acids has a strong effect on predicted binding affinities to most HLA alleles. Replacing PCPS peptide sequences by corresponding conventional sequences in the respective locus, as suggested by the above analyses, improved the predicted binding affinities from below the level of decoy peptides to the same level as conventional and cryptic peptides (Supplementary Fig. S5H). This was not only true for cases with two amino acids at the N- or C-terminal part, but also for the remaining peptides. Taken together, this showed that most remaining PCPS peptides undermined FDR control by the target-decoy approach.
We observed the same three phenomena (low predicted HLA-I binding affinity, left-shifted do novo score distribution, and longer than expected peptides) for peptides spanning putative ribosomal frameshifting events (see Materials and Methods) and for peptides with single amino acid substitutions. Closer inspection of peptide spectra of the latter category revealed that the majority of identifications could be readily explained by unconsidered modifications, such as deamidation, N-terminal acetylation, and oxidation of cysteine to cysteic acid, or by erroneous identification of the precursor mass (Supplementary Fig. S6). We conclude that frameshift, substitution, and PCPS peptides all predominantly represented pseudoidentifications.
HLA peptides harboring a single amino acid exchange caused by tumor-specific nonsynonymous DNA mutations (neoantigens) are promising candidates for antitumor immunotherapeutic approaches in melanoma and other malignancies (38, 39). We reasoned that neoantigen identification might also suffer from false-positive identifications caused by isobaric ambiguities. Indeed, we found that the MS-MS spectrum of the previously identified neoantigen candidate ASWVVPIDIK of MM15 corresponded to the cryptic peptide KLWDPLDLK originating from the 5′-UTR of a different gene. Assignment of the spectrum to the cryptic peptide was unequivocally confirmed with synthetic peptides (Supplementary Fig. S7). The cryptic peptide was predicted to be a strong binder to the patient's HLA-A*03:01 allele. Both peptides had the exact same mass, and the C-terminal part of their sequence was identical (except for I/L), explaining why the incorrect neoantigen sequence gained a relatively high matching score from the Andromeda search engine. This example demonstrates that isobaric ambiguities can easily lead to false-positive neoantigen identification if the correct (cryptic) peptide sequence is excluded from the search.
Cryptic peptides are a substantial part of tumor immunopeptidomes
After excluding PCPS peptides, peptides with single amino acid substitutions, and frameshift peptides, we concentrated on the remaining six categories of cryptic peptides. In a collection of HLA-I tumor immunopeptidome datasets (Supplementary Table S1), we identified more than 6,500 cryptic HLA-I peptides after stringent filtering (n = 6,636, FDR 10%, strong binders as predicted by NetMHCpan 4.0; Fig. 3A; Supplementary Table S3). The largest category with 2,798 peptides consisted of cryptic peptides located in the 5′-UTR of coding transcripts. The first 50 nucleotides (nt) of the UTR were mostly devoid of peptides, and peptides were uniformly distributed across the rest of the UTR (Fig. 3B). This is consistent with these peptides being translated from upstream ORFs (uORF) scattered throughout 5′-UTRs. The second largest category consisted of peptides located inside, but out-of-frame, of ORFs encoding for proteins (“Off-Frame”). For those, the majority was located within the first 200 nt of the protein ORF. This indicated that they are encoded either by uORFs ending within the protein ORF or by internal ORFs due to leaky scanning through the AUG of the protein ORF (Fig. 3C). A total of 276 cryptic peptides were derived from 3′-UTRs (Fig. 3D), of which 101 had no additional upstream in-frame stop codon and thus might originate from stop codon read-through (Fig. 3E; Frame = 0) or frameshifting (Frame = 1, −1). From the 133 cryptic peptides located within introns, 19 had no upstream in-frame stop codon and were likely generated by translation into a retained intron (Fig. 3F; Supplementary Fig. S8 for an example), consistent with a report identifying 17 HLA-I peptides from retained introns in cell lines (40). Almost 1,000 peptides were encoded by annotated ncRNAs. Most of the peptides were located within the first 700 nt (Fig. 3G) of the ncRNA, indicating that they were indeed translated. Finally, we identified 103 HLA-I peptides from intergenic regions.
For all peptides, we identified the first downstream in-frame stop codon and the closest upstream canonical (AUG) or noncanonical (CUG, GUG, ATC, and ACG) start codon (either in an unbiased manner or by prioritizing AUG > CUG > GUG > ATC > ACG; see Materials and Methods). For all cryptic categories, we observed the same expected distribution of start codon frequencies (Fig. 4A), providing strong evidence that all categories originated from bona fide ORFs and that translation initiation at noncanonical start codons could produce HLA-I–presented peptides.
The canonical pathway for HLA-I peptide presentation includes protein degradation by the proteasome into 9 to 20 amino acid–long peptides, transport into the endoplasmic reticulum (ER) by the transporter associated with antigen processing (TAP), N-terminal trimming by ERAP1/2 to 9 to 10 amino acids, and binding to HLA-I. For most cryptic peptides (n = 3,634; 54%) independent of their category, the predicted start codon (with prioritization) was within 10 amino acids from the N-terminus of the peptide (Fig. 4B), and for 1,240 (>18%) cases, the peptide represented the C-terminus of the translation product (Fig. 4C). Thus, at least 10% (n = 710) of cryptic peptides could directly enter the ER via TAP independently of processing by the proteasome or any cytoplasmic peptidases. This is in contrast to conventional peptides, which had uniform distance distributions for translation initiation sites (TIS) and stop codons (Fig. 4B and C) with the exception that about twice as many peptides as expected from the uniform distribution were located at the C-terminus of the protein. We speculate that this reflects the number of cleavage events necessary to produce the peptide.
The percentage of cryptic peptides varies among different alleles
The amino acid frequency distributions of conventional and cryptic peptide showed clear differences for both anchor residues (B and F pocket), and most prominently indicated more frequent basic residues in the F pocket for cryptic peptides (Supplementary Fig. S9A). This was also true after controlling for different background distributions of amino acid frequencies in and outside of the proteome (Fig. 4D). The HLA-I locus is highly polymorphic, and each HLA-I allele has a distinct sequence preference for ligands. Intriguingly, the percentage of cryptic peptides was highly HLA-I allele–specific and varied greatly (<1% to >15%) between different HLA-I alleles (Fig. 4E). For instance, consistently across experiments (Supplementary Table S4), approximately 10% of the ligands bound to alleles from the common A03 HLA-I supertype were cryptic peptides. The major determinant for ligand specificity of A03 is a basic residue at the C-terminus. We hypothesized that this might be due to a specific processing mechanism of cryptic peptides, such as a protease with specificity for basic amino acids. However, the percentage of processing-independent peptides (peptides ending directly at a stop codon) was constant for all alleles including A03 (Supplementary Fig. S9B), and the percentage of basic residues at the C-terminus of processed and processing-independent cryptic peptides was the same (Supplementary Fig. S9C). Thus, the higher frequency of cryptic peptides from A03 is not due to a distinct processing mechanism. In summary, we found that the percentage for cryptic peptide presentation is allele specific. The reasons for the preference of certain alleles to cryptic peptides, however, remain elusive.
The success of cancer immunotherapies, such as immune checkpoint blockade (41) or adoptive T-cell transfer (42), demonstrates that cytotoxic T cells can efficiently recognize and eliminate tumor cells in vivo. However, the tumor-specific antigens that are recognized by the corresponding cytotoxic T cells mediating the antitumor effect are mostly unknown. Neoantigens came into focus as a potentially relevant class of tumor-specific targets (38, 39). Despite tremendous efforts, identified neoantigens are scarce. In contrast, cryptic HLA peptides that were first discovered more than 30 years ago (43) have largely been neglected as potential tumor-specific targets. One crucial reason for this is the lack of computational methods and tools for their efficient identification. Laumont and colleagues identified a number of tumor-specific cryptic HLA peptides in murine cancer cell lines, as well as in human tumor tissue. Their individualized proteogenomic approach was based on customized databases derived from RNA-seq of tumor cells and medullary thymic epithelial cells (16). In contrast, our de novo sequencing–based approach enables more efficient and more sensitive identification of cryptic peptides without the need for DNA or RNA-sequencing of the corresponding tumor sample. This enabled us to uncover that the frequency of these epitopes was highly HLA-I allele–specific and varied considerably between different HLA-I allomorphs. A large fraction of cryptic peptides apparently takes a shortcut in antigen processing. Their C-terminus is defined by translation termination at the stop codon directly downstream of the peptide and not by the proteasome. It is unclear whether the proteasome is required to cleave upstream of the peptide. However, a large number of cryptic ORFs are translated into polypeptides shorter than 25 amino acids. Thus, in contrast to all known peptides from large protein-coding ORFs, they might be presented in a proteasome-independent manner. A subset of the cryptic peptides presumably might derive from aberrant, tumor-specific expression, such as intron retention (40), frameshift mutations (44), alternative splicing (45), or translation events such as translation from unconventional 5′ start sites (46) that might generate cryptic neoantigens. Further evidence for the existence of tumor-specific translation events comes from a study, in which comparative Ribo-seq analyses of hepatocellular carcinoma versus normal tissue were performed (47). It has been shown that knockdown of the 40S ribosomal protein S28 increases translation from noncanonical ORFs and initiation from non-AUG start codons (48). This indicates that variations in the translation machinery of tumor cells, for example, by mutation of ribosomal proteins or translation initiation factors, can lead to tumor-specific (cryptic) translation events. Hence, similar to classic neoantigens, all these types of peptides may result from tumor-specific mutations (e.g., by frameshift mutations, deletions/insertions, or by mutations generating new transcription or translation start sites). However, in contrast to classic neoantigens, they do not resemble self-peptides that only differ by a single amino acid and are, thus, more likely to induce tumor-specific immune responses (49). Thus, cryptic HLA peptides that can now be efficiently and reliably identified with Peptide-PRISM provide a rich source of potential targets for cancer immunotherapy that can be tested for tumor specificity and immunogenicity.
Peptides originating from PCPS were claimed to contribute up to >30% of HLA-I ligandomes (20). However, the contribution of PCPS to immunopeptidomes has raised considerable controversy in the field. Here, we unequivocally showed that >99% of the reported PCPS peptides are false identifications and resulted from various kinds of methodologic errors. Our work refutes reported evidence that PCPS occurs frequently in vivo (2, 20).
Disclosure of Potential Conflicts of Interest
F. Erhard reports grants from Deutsche Forschungsgemeinschaft during the conduct of the study and a patent for EP 20 170 185.1 pending to the European Patent Office. B. Schilling reports grants from Interdisciplinary Center for Clinical Research (IZKF) Würzburg during the conduct of the study and a patent for EP 20 170 185.1 pending to the European Patent Office. A. Schlosser reports a patent for EP 20 170 185.1 pending to the European Patent Office. No potential conflicts of interest were disclosed by the other author.
F. Erhard: Conceptualization, data curation, software, formal analysis, funding acquisition, visualization, methodology, writing–original draft. L. Dölken: Conceptualization, funding acquisition, writing–review and editing. B. Schilling: Conceptualization, funding acquisition, writing–review and editing. A. Schlosser: Conceptualization, data curation, formal analysis, funding acquisition, investigation, visualization, methodology, writing–original draft.
The authors thank Wolfgang Kastenmüller, Georg Gasteiger, and Elmar Wolf for critical comments on this article. This work was supported by a grant from the Interdisciplinary Center for Clinical Research (IZKF) Würzburg (to B. Schilling and A. Schlosser) and a grant from the Deutsche Forschungsgemeinschaft (FOR 2830, DO 1275/7-1; to F. Erhard and L. Dölken).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.