Tumor antigens can emerge through multiple mechanisms, including translation of noncoding genomic regions. This noncanonical category of tumor antigens has recently gained attention; however, our understanding of how they recur within and between cancer types is still in its infancy. Therefore, we developed a proteogenomic pipeline based on deep learning de novo mass spectrometry (MS) to enable the discovery of noncanonical MHC class I–associated peptides (ncMAP) from noncoding regions. Considering that the emergence of tumor antigens can also involve posttranslational modifications (PTM), we included an open search component in our pipeline. Leveraging the wealth of MS-based immunopeptidomics, we analyzed data from 26 MHC class I immunopeptidomic studies across 11 different cancer types. We validated the de novo identified ncMAPs, along with the most abundant PTMs, using spectral matching and controlled their FDR to 1%. The noncanonical presentation appeared to be 5 times enriched for the A03 HLA supertype, with a projected population coverage of 55%. The data reveal an atlas of 8,601 ncMAPs with varying levels of cancer selectivity and suggest 17 cancer-selective ncMAPs as attractive therapeutic targets according to a stringent cutoff. In summary, the combination of the open-source pipeline and the atlas of ncMAPs reported herein could facilitate the identification and screening of ncMAPs as targets for T-cell therapies or vaccine development.

The accelerated adoption of mass spectrometry (MS) for high-throughput profiling of immunopeptidomes in cancer has led to several discoveries. Leveraging these studies to improve cancer immunotherapy involves connecting the wealth of immunopeptidomic data to immunogenomics, where the goal is to carefully choose effective targets for T-cell therapies or vaccine development.

The discovery of cancer antigens has mainly focused on mutated tumor-specific antigens (neoantigens) arising from patient-specific somatic mutations. It has been shown that only a small percentage of the numerous nonsynonymous mutations in a tumor actually produce neoantigens (1, 2). The challenging task of identifying those that can evoke a suitable tumor rejection was addressed by Ebrahimi-Nik and colleagues (3). Using a combination of genomics, shotgun MS immunopeptidomics, and targeted MS, they found that (i) MS-identified neoepitopes are a rich source of tumor rejection–mediating antigens, (ii) neoantigens derive from passenger mutations, and (iii) binding affinity and CD8+ T-cell responses in tumor-bearing hosts are poor predictors of antitumor activity in vivo. Although neoantigens confer an advantage to patients undergoing immunotherapy (4), their patient-specific nature is a major bottleneck when producing off-the-shelf treatments for a large number of individuals. Alternatively, shared neoantigens (ref. 5; i.e., recurrent mutations in cancer) could offer a new line of population-level immunotherapy. However, high-throughput experimental profiling of such broadly presented neoantigens across the human population is a long-term goal with many milestones to be achieved.

Recently, tumor antigens that exceed the exome boundaries (i.e., noncanonical) have attracted attention as potential targets as a result of their immunogenicity and recurrence among cancer patients (6). These antigens find their way to the cell surface through rapid degradation (7) of “noncoding” translation products stemming from novel open reading frames (nORF; ref. 8). In addition, “noncoding” translation products can originate from other sources (9), including intron retention (IR; ref. 10), ribosomal slippage (11), and frameshift mutations (12). In 2016, Laumont and colleagues (13) demonstrated their association with MHC molecules using a reductionist approach based on 6-frame translation and subsequently their recurrence between patients (14). Ribo-Seq has proven to be an immensely valuable tool for identifying noncanonical MHC class I–associated peptides (ncMAP) as it provides experimental evidence for their noncanonical translation and MHC class I presentation when combined with MS immunopeptidomics (6, 15, 16). Despite previous efforts to study noncanonical immunopeptidomes, the requirements of such multi-level experimental data (Ribo-seq and/or RNA sequencing) or computational struggles when dealing with large MS databases have hindered their large-scale profiling in a harmonized manner across multiple cancer types from hundreds of samples.

With these considerations in mind, we developed Closed Open De novo – deep immunopeptidomics pipeline (COD-dipp), a pipeline based on deep learning de novo MS to enable the discovery of ncMAPs. Owing to the potential involvement of posttranslational modifications (PTM) in this process (1), we added an open search component for their discovery. We applied COD-dipp to a large-scale dataset using immunopeptidome profiles of over 772 samples from 26 (1, 2, 13, 14, 17–39) published studies and 11 cancer types. We identified a range of PTMs of potential interest from a therapeutic standpoint and tackled the noncanonical immunopeptidome. We validated the de novo identified ncMAPs and controlled their FDR to 1% using a second-round search with tuned PTM parameters, in addition to a series of quality-control steps. Our large-scale analysis revealed 8,601 ncMAPs, accounting for 1.7% of immunopeptidomes. These peptides had varying levels of tumor selectivity, defined by their parent gene expression levels in normal tissues. We suggest 17 ncMAPs as attractive therapeutic targets using a stringent tumor-selectivity cutoff.

Dataset selection

Twenty-four studies were selected on the basis of a list of keywords related to immunopeptidomics (Supplementary Method S1). Low-resolution analyses were eliminated, and MHC class I–related datasets conducted with at least one of the following instruments were kept: Q Exactive, Q Exactive plus/HF/HFX, LTQ Orbitrap Velos, LTQ Orbitrap Elite, Orbitrap Fusion, and Orbitrap Fusion Lumos (Supplementary Table S1). An additional study was considered from the MassIVE (RRID:SCR_013665) database, as it incorporates 95 HLA-A, -B, -C, and -G mono-allelic cell lines (28, 39). An auxiliary immunopeptidomic dataset (38) covering 30 healthy tissues from 21 healthy individuals was also used to partly assess cancer selectivity.

Proteogenomic database generation

Canonical protein database for MS database search

A protein database was downloaded using ENSEMBL r94 BioMart (RRID:SCR_002344); decoy sequences were appended by reversing the target sequences, and 116 contaminant proteins were added (40).

Noncanonical protein database for alignment using BLAST-like alignment tool

A pre-mRNA 3-frame translation (3FT) database was generated from genes with a protein-coding biotype based on ENSEMBL r94 (RRID:SCR_002344) using the AnnotationHub and Biostrings (RRID:SCR_016949) R packages.

COSMIC mutated protein database for BLAST-like alignment tool alignment

COSMIC (RRID:SCR_002260) coding Mutants (41) VCF v95 was downloaded along with ENSEMBL v94 CDS and GTF files. An in-house Python (RRID:SCR_008394) package was used along with the previously mentioned inputs to generate a FASTA file containing the corresponding mutated protein sequences.

MS computational analysis

Algorithms representing three main philosophies of peptide-spectrum matching including open search, de novo sequencing, and closed search were used. The open search approach allowed the identification of distantly related peptides and could identify PTMs and single amino acid variations. The de novo sequencing approach derived sequences from first-principle analysis of the MS2 spectra. The closed search approach, used as a validation step, assumed a specific set of reference protein sequences and allowed for limited PTMs. Although each approach has its own limitations, our strategy addressed them by combining a closed search with a de novo sequencing approach and implementing multiple filtering steps for accuracy control and quality control checkpoints (see Supplementary Fig. S1).

Data conversion

The proprietary RAW files acquired from the selected instruments were converted to mzML and MGF formats using msconvert (ProteoWizard version 3.0.19295. c8b8b470d, RRID:SCR_012056) with the peak-picking and TPP compatibility filters.

Open search analysis

The MSFragger (42) v2.2 search engine was used to conduct an open search analysis against the ENSEMBL r94 protein database in combination with PTMiner (43) v1.1.2, to apply a transfer FDR and a false localization rate of 1% (FLR, the rate of falsely localizing the site of modification). Unspecific cleavage generating peptides 8 to 25 amino acids long with no fixed/variable PTMs was considered. Further analysis revealed that the frequent unexplained mass shifts observed during the open-search annotations were caused by nonspecific cleavage. To address this issue, an open-search postprocessing algorithm, PTMiner, was employed to effectively corrects for mass shifts introduced by in-source fragmentation, nonspecific digestion, or missed cleavage, by adding or deleting amino acids from the peptide N- or C-termini. For instance, a deviation of −128.1 to −128.08 Dalton on lysine residues was frequently detected on the first 2 or last 2 amino acids of peptides. The deviation was caused by nonspecific cleavage during the open search and resulted in an incorrect assignment of a negative mass shift of a lysine due to the presence of an additional lysine in the sequence. As these cases are not biologically meaningful, unexplained mass shifts were removed from the final results of the study.

De novo analysis

DeepNovoV2 (44) is a neural-network-based de novo peptide sequencing model that integrates convolutional neural networks and long short-term memory (LSTM) architectures. This deep-learning design extracts features from both the spectrum and the language of the presented peptides. DeepNovo has demonstrated improved performance compared with the state-of-the-art de novo sequencing algorithms by large margins (44). The model can be tuned on a restricted peptide space to improve its performance. The training, testing, and validation sets were derived from MS-GF+ (v2019.04.18, RRID:SCR_015646) database search results for each sample. The search used the ENSEMBL v94 protein database and 8 to 25 amino acid peptides with unspecific cleavage, no fixed/variable PTMs and an FDR of 1% applied by Scavager (45). The trained models were used to perform de novo (prediction) on the remaining unmatched spectra of each sample (from MS-GF+ after 1% FDR control). Accuracy was calculated by comparing the de novo predicted sequences and MS-GF+ results on the validation set. A de novo score threshold that controlled the accuracy at 90% within the validation set was applied to the predicted sequence in a sample-specific manner.

De novo peptide annotation

De novo peptides from canonical human proteins were identified using BLAST-like alignment tool (BLAT; ref. 46; RRID:SCR_011919) alignment against the target-decoy protein database. Sequences perfectly matching any protein sequence were considered exonic (one mismatch allowed for the isobaric amino acids leucine and isoleucine). All remaining sequences unexplained by proteins were considered potential noncanonical peptides and were aligned against the pre-mRNA 3FT database. Stringently, peptides perfectly matching a 3FT sequence without any mismatch were required to have at least three mismatches with any known protein sequence before being considered noncanonical. Because peptide-spectrum matches (PSM) can be assigned without complete sequencing accuracy, requiring a 3 amino acid difference alongside the 90% accuracy cutoff above increases the confidence that the peptides assigned fall far outside the standard human proteome. Remaining de novo peptides without any canonical or noncanonical annotation were labeled as ‘unmapped peptides’ and discarded.

Second-round search

A second-round search was performed using the FragPipe (40, 42) headless pipeline, which includes MSFragger v3.4, MSBooster (bioRxiv 2022.10.19.512904), and Philosopher (40). Noncanonical peptides from all samples were concatenated with the ENSEMBL v94 protein into a custom database. Only four of the most abundant PTMs were considered to avoid a large search space complexity, inflated FDR, and decreased sensitivity. The following variable PTMs were included: methionine oxidation, N-terminal acetylation, cysteinylation, and cysteine carbamidomethylation (for samples treated with iodoacetamide). Unspecific cleavage generating peptides 7 to 15 amino acids long was considered. The ion, PSM, and peptide-level FDR were maintained at 1%.

Alignment of immunopeptides to the genome

Second-round search noncanonical peptide coordinates were retrieved from the 3FT database FASTA headers and stored in BED format.

Open reading frame analysis

Upstream genomic sequences of ncMAPs were scanned for start codons up to the first encounter with a stop codon. Sequences were centered around the detected start codons and stretches of 100 nucleotides from each side were extracted. Translation initiation site (TIS) scores were predicted for each sequence using TITER (47), a deep-learning-based framework for accurately predicting TIS on a genome-wide scale based on QTI-seq data. A TIS score greater than 0.5 was considered a positive prediction.

IR analysis

For each intron in the UCSC hg38 KnownGene table (RRID:SCR_005780), the first codon coordinates of the corresponding upstream exon in-frame with the canonical translation were extracted and stored in BED format (see Pseudocode 1). Intronic coordinates from the generated BED file were intersected with the ncMAPs BED file using pybedtools (ref. 48; RRID:SCR_021018). Intronic retention events were considered possible when ncMAPs within introns were in-frame with their upstream exons (see Pseudocode 2).

// Pseudo-code 1: extracts the start coordinate of the first in-frame codon for each exon (inframeCoordinate variable)

for each transcript

 remainderValue = θ

 for each exon

  if strand is positive

   if downstream intron exists

leftoverBases = remainder of (ExonEndCoordinate - remainderValue - ExonStart + 1) / 3

    if remainderValue is equal to θ

     inframeCoordinate = ExonStartCoordinate

else

 inframeCoordinate = ExonStartCoordinate – remainderValue

if leftoverBases is greater than θ

 remainderValue = 3 - leftoverBases

addToTable(transcript, chromosome, ExonStart, ExonEnd, inframeCoordinate, IntronStart, IntronEnd)

if strand is negative

 if downstream intron exists

  leftoverBases = remainder of (ExonSart - ExonEndCoordinate + remainderValue + 1) / 3

 if remainderValue is equal to θ

  inframeCoordinate = ExonEndCoordinate

 else

  inframeCoordinate = ExonEndCoordinate + remainderValue

 if leftoverBases is greater than θ

  remainderValue = 3 - leftoverBases

addToTable(transcript, chromosome, ExonStart, ExonEnd, inframeCoordinate, IntronStart, IntronEnd)

// Pseudo-code 2: checks if each intronic ncMAP is in-frame with its upstream exon.

ncMAPIsInFrame = False

if strand is positive

 // firstCoordinate = start coordinate of ncMAP

 // secondCoordinate = start coordinate of the first inframe codon from previous exon

 coordinateDifference = firstCoordinate – secondCoordinate

 if remainder of (coordinateDifference / 3) is equal to θ

  ncMAPIsInFrame = True

else:

 // firstCoordinate = start coordinate of first inframe codon from previous exon

 // secondCoordinate = end coordinate of ncMAP

 coordinateDifference = firstCoordinate – secondCoordinate

 if remainder of (coordinateDifference / 3) is equal to θ

  ncMAPIsInFrame = True

Frameshift mutation analysis

The COSMIC (41) v95 coding mutations (RRID:SCR_002260) VCF file was downloaded and converted into a protein FASTA file using a VCF-to-Proteogenomics toolkit (https://github.com/immuno-informatics/VCFtoProteogenomics) ncMAPs were then aligned to the resulting 16 GB FASTA using BLAT v35 (46). Only hits with exact matches to sequences from frameshift mutations were considered.

Comparison of the identified ncMAP between studies

ncMAPs from 4 different studies (6, 13, 16, 49) were collected. First, sequences were aligned to the human proteome (ENSEMBL v94) using BLAT v35 (46). Sequences found in human proteins were discarded, and the remaining sequences were aligned to the 3FT database with one mismatch allowance for the isobaric amino acids leucine and isoleucine, as allowed for COD-dipp ncMAPs. Genomic coordinates of the sequences found in the 3FT database were extracted and overlapped between studies using the ChIPpeakAnno (50) R package (RRID:SCR_012828). A minimum overlap of 21 nucleotides (7 amino acids) between two sequences was required.

Cancer selectivity of the ncMAP

Tumor specificity has been previously implied when peptide parent genes are either completely absent or present in trace amounts in healthy tissues (6, 14, 16) because MHC class I presentation is preferentially derived from highly abundant transcripts (28, 30). While tumor specificity implies the expression of an antigen solely in tumor samples, the experimental design of this study cannot guarantee this constraint. Instead, cancer-selective ncMAPs were conservatively identified through three iterative steps:

Step 1: panel of normal immunopeptidomes

In addition to the 88 healthy MS samples from the initial set of the 25 considered studies, the HLA Ligand Atlas (38) was used to extend the panel of normal immunopeptidomes and partly assess the cancer selectivity of the 8,601 identified ncMAPs. The HLA Ligand Atlas is a pan-tissue immunopeptidomic reference for 30 healthy tissue types obtained from 21 human subjects. The resulting 334 healthy samples (see Supplementary Table S1) were analyzed in the same manner as in the second-round search (see Second-round search above). ncMAPs identified in the panel of normal immunopeptidomes were labeled as non-cancer selective.

Dimensionality reduction of the HLA-binding motif space:

Binding affinity prediction was employed to identify similarities and differences in HLA-binding motifs among the 65 healthy and 51 tumor-only HLA alleles. NetMHCpan-4.1 was used to evaluate the binding of 1,000,000 random peptides to each allele, which resulted in a binding matrix (BM) of 116 alleles and 1,000,000 peptides. A value of 1 was assigned to strong binders (EL rank ≤ 0.5%) in the BM; otherwise, a value of 0 was assigned. A pairwise cosine distance matrix (DM) was then calculated to assess the similarity of binding between alleles. The DM was then reduced using t-SNE to visualize the data in 2D with a perplexity of 20 and 500 iterations.

Step 2: parental gene expression levels in healthy tissue

The gene expression levels of the identified ncMAPs were retrieved from the GTEx v8 (51) dataset, consisting of 29 tissues from 948 healthy donors and 17,382 overall samples. Considering all individuals, the 90th percentile value of normalized expression was assigned to each gene per tissue as a strict step to guarantee the upper-end gene expression in healthy tissues. A stringent cutoff for cancer selectivity was used to shortlist ncMAPs whose parent genes fell below a 1 TPM expression cutoff (excluding the testis tissue given its immune-privileged status). It is worth noting that this stringent threshold removes 92% of protein-coding genes that show expression above 1 TPM in any tissue within the GTEx v8.

Step 3: protein expression levels in healthy tissue

The protein expression levels of ncMAPs passing the 1 TPM cutoff were retrieved from the Human Protein Atlas V22.0 database (52). ncMAPs without parent protein expression in healthy tissues were labeled as cancer-selective (excluding the testis tissue given its immune-privileged status).

Code availability

The COD-dipp code, intended for high-performance computing, is available on the GitHub repository: https://github.com/immuno-informatics/COD-dipp.

Data availability

The data analyzed in this study were obtained from PRIDE at PXD004746, PXD014017, PXD012308, PXD011628, PXD012083, PXD011766, PXD013057, PXD011723, PXD007203, PXD004233, PXD003790, PXD001898, PXD007860, PXD011257, PXD007935, PXD009749, PXD009753, PXD009750, PXD009751, PXD009752, PXD009754, PXD009755, PXD004023, PXD007596, PXD009531, PXD010808, PXD008937, PXD009738, PXD006939, PXD005231, PXD000394, PXD004894, PXD019643 and from massIVE at MSV000080527, MSV000084172, MSV000084442. The results of this study are available within the article and its supplementary data files and are accessible on the following figshare repository: https://doi.org/10.6084/m9.figshare.16538097.

Immunopeptidomic MS datasets

We selected 25 immunopeptidomic MS studies (see Supplementary Table S1) to create a cancer-centered dataset of MHC class I presentation. Data-dependent acquisition (DDA) studies covered eleven cancer types distributed across the brain (Glioblastoma and Meningioma), lung, skin, liver, blood (Leukemia and Lymphoma), colon, ovaries, kidneys, and breast. Moreover, tumor and healthy samples were derived from either cell lines or patient tissues (Fig. 1A and Supplementary Method S1). We selected publicly available studies with data generated using high-resolution MS instruments (LTQ Orbitrap, Q Exactive Plus/HF/HFX, and Fusion Lumos) to minimize the bias associated with older tandem MS instrumentation (Fig. 1B). Within our dataset, the most commonly used monoclonal antibody for HLA class I immunoprecipitation (IP) was W6/32 in comparison with the other antibodies (BB7.2 and G46–2.6; Fig. 1C, see Supplementary Table S1). The selected studies covered five different HLA class I genes, with HLA-A, B, and C being the most studied compared with HLA-E and -G (Fig. 1D). Furthermore, the included MS samples covered 114 HLA alleles (Fig. 1E).

Figure 1.

Infographics of immunopeptidomic datasets included in this study. A, Different types of cancer considered in this study with the number of samples and sample types per cancer type. B, Proportions of different MS instruments used in this study. C, Antibodies used for IP. D, Overall count of HLA alleles per HLA gene. E, Overall count of MS immunopeptidomic samples per HLA allele.

Figure 1.

Infographics of immunopeptidomic datasets included in this study. A, Different types of cancer considered in this study with the number of samples and sample types per cancer type. B, Proportions of different MS instruments used in this study. C, Antibodies used for IP. D, Overall count of HLA alleles per HLA gene. E, Overall count of MS immunopeptidomic samples per HLA allele.

Close modal

COD-dipp

We present COD-dipp, an open-source high-throughput pipeline with novel postprocessing steps, to deeply interrogate immunopeptidomic datasets (Fig. 2). We used this pipeline to screen for ncMAPs in datasets using DDA due to its widespread use. To identify posttranslationally modified MHC class I–associated peptides (ptmMAP), we performed an open-search analysis with MSFragger (42) and controlled both FDR and the FLR to 1% with PTMiner (43). To identify ncMAPs, we used DeepNovoV2 (44) for de novo analysis. In combination with the PSM level information of MS-GF+ (53), DeepNovoV2 was trained to interpret the raw MS data in a sample-specific manner. The training step for such a deep learning approach is crucial for learning the features of tandem mass spectra, fragment ions, and leveraging sequence patterns in the immunopeptidome to impute missing MS2 fragments. All high-quality de novo peptides (90% accuracy) were sequentially mapped (46) to the human reference proteome to reveal the de novo-based canonical MHC class I–associated peptides, and to a 3FT database to reveal the de novo-based ncMAPs. Finally, an orthogonal validation step was performed by a second-round search to control a 1% FDR for the de novo identified ncMAPs while considering the most abundant PTMs found by the open-search strategy. Applying the COD-dipp pipeline across the dataset revealed the breadth of ptmMAPs and ncMAPs.

Figure 2.

COD-dipp: A new high-throughput pipeline for a deep interrogation of immunopeptidomic datasets. Samples are first analyzed with an open search strategy to detect the landscape of PTMs. An FLR for the PTMs and FDR of 1% are applied. Simultaneously, the samples are analyzed using a novel de novo approach to identify noncanonical peptides. The de novo strategy trains a model per sample using quality-controlled PSMs from the MS-GF+ search engine to learn the direct interpretation of sample-specific mass spectra. The MS-GF+ results are split into three groups: training and testing to tune the hyperparameters and account for overfitting, and a validation group to approximate the accuracy per sample. De novo predicted peptides with an accuracy of at least 90% are sequentially mapped against the Human proteome (HP) then a 3-frame translation (3FT) database of protein-coding genes (1 mismatch allowed between leucine/isoleucine, i.e., Xle). Predicted de novo peptides matching any known protein are labeled “canonical”. Peptides mapping to the 3FT database with at least 3 amino acids mismatches from any known protein sequence are labeled “noncanonical”. Finally, a second-round search is performed as a validation approach. Four of the most abundantly identified PTMs and a custom database consisting of ENSEMBL proteins and noncanonical peptides are considered. The resulting canonical and noncanonical peptides are controlled to an FDR of 1% and aligned to the hg38 human genome.

Figure 2.

COD-dipp: A new high-throughput pipeline for a deep interrogation of immunopeptidomic datasets. Samples are first analyzed with an open search strategy to detect the landscape of PTMs. An FLR for the PTMs and FDR of 1% are applied. Simultaneously, the samples are analyzed using a novel de novo approach to identify noncanonical peptides. The de novo strategy trains a model per sample using quality-controlled PSMs from the MS-GF+ search engine to learn the direct interpretation of sample-specific mass spectra. The MS-GF+ results are split into three groups: training and testing to tune the hyperparameters and account for overfitting, and a validation group to approximate the accuracy per sample. De novo predicted peptides with an accuracy of at least 90% are sequentially mapped against the Human proteome (HP) then a 3-frame translation (3FT) database of protein-coding genes (1 mismatch allowed between leucine/isoleucine, i.e., Xle). Predicted de novo peptides matching any known protein are labeled “canonical”. Peptides mapping to the 3FT database with at least 3 amino acids mismatches from any known protein sequence are labeled “noncanonical”. Finally, a second-round search is performed as a validation approach. Four of the most abundantly identified PTMs and a custom database consisting of ENSEMBL proteins and noncanonical peptides are considered. The resulting canonical and noncanonical peptides are controlled to an FDR of 1% and aligned to the hg38 human genome.

Close modal

ptmMAPs

The open search analysis reported 4.03% of the MS spectra showing PTMs (Fig. 3A). Some identified PTMs were confirmatory, representing chemical modifications from sample preparation methods (cysteine carbamidomethylation) or common chemical derivatives (methionine oxidation and di-oxidation). We also observed PTMs that are extremely common in proteins, such as protein N-terminal acetylation, affecting multiple properties such as half-life time, folding, and interaction. On the other hand, some of the identified PTMs have been reported previously to increase immunogenicity against diseases (54) and protect against degradation [tri-oxidation of cysteine (55), cysteinylation (56), and N-term serine acetylation, see Fig. 3B and Supplementary Table S2]. Furthermore, 1.12% of spectra from open search showed unknown mass shifts, as illustrated in Fig. 3A (green and red). This category was partly populated by computational artifacts and was excluded from the final results. To validate these findings, we performed an independent post-search by crosschecking the identifications from our open search with those of the original studies. The results showed 96.1% agreement in PSMs, which are detailed in Supplementary Method S2: Validation 1 and Supplementary Fig. S2.

Figure 3.

Landscape of posttranslationally modified and ncMAPs. Open search: A, Overview of PTMs identified by open search (blue: spectra without PTMs, orange: spectra with a known UNIMOD PTM localized on a specific amino acid on the peptide. Green: The mass shift is localized, however the known PTM options do not fit the modified residue. Red: Otherwise). B, Most abundant “annotated PTMs” grouped by type. Second-round search: C, Fraction of canonical (dark gray) and noncanonical (light gray) MAPs in the immunopeptidome. D, Proportion of canonical (dark gray) and noncanonical (light gray) MAPs with/without PTMs. E, Fraction of binders versus nonbinders for both canonical and noncanonical MAPs using NetMHCpan 4.1.

Figure 3.

Landscape of posttranslationally modified and ncMAPs. Open search: A, Overview of PTMs identified by open search (blue: spectra without PTMs, orange: spectra with a known UNIMOD PTM localized on a specific amino acid on the peptide. Green: The mass shift is localized, however the known PTM options do not fit the modified residue. Red: Otherwise). B, Most abundant “annotated PTMs” grouped by type. Second-round search: C, Fraction of canonical (dark gray) and noncanonical (light gray) MAPs in the immunopeptidome. D, Proportion of canonical (dark gray) and noncanonical (light gray) MAPs with/without PTMs. E, Fraction of binders versus nonbinders for both canonical and noncanonical MAPs using NetMHCpan 4.1.

Close modal

ncMAPs

We explored the ncMAP landscape in cancer using our workflow (Fig. 2) and identified 10,413 unique de novo–based ncMAPs from intragenic noncoding regions (before the second-round search validation), which accounted for 3.7% of the identified de novo sequences. We took two additional validation steps, including checking the identification scores as well as the correlation between the experimental and theoretical liquid chromatography retention times, to guarantee the correctness of these identifications (see Supplementary Method S2: Validation 2 and 3, and Supplementary Fig. S2). The de novo noncanonical peptides showed strong evidence of high-quality identification (i.e., correctly predicted complete peptide sequences). Even with this strong evidence, it was possible that chromatic behavior remained unchanged in certain instances where neighboring amino acids were in flipped positions, or that a 90% accuracy rate still led to an uncertain FDR percentage. Hence, we confirmed the identified 10,413 de novo–based ncMAPs by performing a second-round search for additional validation and controlling the FDR at 1%. Several PTMs were also considered in the parameters from the a priori knowledge provided by the open search strategy. Of the 516,382 uniquely identified peptides in the second-round search, 1.7% (8, 601) were noncanonical (Fig. 3C and Supplementary Table S3). The PTM profiles (Fig. 3D) of canonical (dark gray) and noncanonical (light gray) peptides appeared to be similar, with M oxidation being the most prevalent modification. The identified ncMAPs showed comparable spectra from patients within the same studies and from different studies (Supplementary Figs. S3, S4, and S5 provide examples of such similarities). The binding affinities of all 8,601 ncMAPs resulting from the second-round search were further investigated using NetMHCpan 4.1 (57). The binding prediction analysis showed a comparable binding rate for both the canonical (90%) and noncanonical (93%) MAPs, as depicted in Fig. 3E. We further took four additional independent post-search validation steps, including checking retention time shifts induced by PTMs, mass accuracy, and spectra comparison with those of the original studies, guaranteeing the correctness of the ncMAPs identified by the second-round search (see Supplementary Method S2: Validation 4, 5, 6, and 7, and Supplementary Fig. S2).

Comparison of COD-dipp ncMAPs with the literature

To assess the performance of our COD-dipp method, we conducted a comparison with the results of peptide-PRISME by Erhard and colleagues 2020 (49). Our comparison was based on three common studies (1, 14, 34) and resulted in 3,453 at 1% FDR from COD-dipp along with 4,576 ncMAPs at 10% FDR from Erhard and colleagues. We first aligned Erhard and colleagues ncMAPs to the human proteome and eliminated a small fraction (1.4%) that matched the canonical protein sequences (Fig. 4A, left-hand side). Because the COD-dipp ncMAPs were restricted to the 3FT of protein-coding genes, we aligned the remaining ncMAPs from Erhard and colleagues to the same 3FT database for comparison purposes. Fig. 4A (left-hand side) shows that 68.25% of ncMAPs were successfully mapped to the 3FT database. The rest (30.35%) that did not align to any of the human proteome or the 3FT database are shown in yellow on Fig. 4A left-hand side. This unmapped fraction consisted of ncMAPs from regions of the genome not studied herein, such as intergenic regions, antisense translation, etc. The successfully mapped fraction to the 3FT database (navy) of 3,123 ncMAPs along with 3,453 ncMAPs from COD-dipp were then compared, as shown in Fig. 4A right-hand side (see Supplementary Table S4). peptide-PRISME shared 38% (1,197) of its ncMAPs (intersection) with COD-dipp (Fig. 4A right-hand side) and showed 62% (1,926) of exclusive ones. Adjusting the higher FDR used by peptide-PRISME from 10% to 1% increased the shared fraction to 48.9% (Fig. 4B), along with a ∼2.4-fold decrease in total ncMAPs (from 4,576 to 1,916). At an FDR of 1%, COD-dipp identified 2.34 times more exclusive ncMAPs (2,298 vs. 979) from the 3FT of protein-coding genes.

Figure 4.

Comparison of COD-dipp ncMAPs with other studies. Because the COD-dipp ncMAPs are restricted to the 3-frame translation (3FT) of protein-coding genes, sequences from the literature were aligned to the same 3FT database for comparison purposes. The intersection is based on genomic coordinates to deal with sequences that partially match (i.e., longer, shorter, or partially overlapping). Because the Venn is generated by overlapping genomic coordinates of the ncMAPs, the original counts for each study are listed from left to right (i.e., on the right-hand side of panel C, the notation 29/41 refers to 29 instances for Chong and colleagues 2020 and 41 for COD-dipp). A, Comparison with peptide-PRISM published ncMAPs at a 10% FDR. COD-dipp ncMAPs were restricted to 3 studies in common with Erhard and colleagues 2020. B, Comparison with peptide-PRISM published ncMAPs at a 1% FDR. COD-dipp ncMAPs were restricted to 3 studies in common with Erhard and colleagues 2020. C, Comparison of the atlas of ncMAPs revealed by COD-dipp to 3 previous studies.

Figure 4.

Comparison of COD-dipp ncMAPs with other studies. Because the COD-dipp ncMAPs are restricted to the 3-frame translation (3FT) of protein-coding genes, sequences from the literature were aligned to the same 3FT database for comparison purposes. The intersection is based on genomic coordinates to deal with sequences that partially match (i.e., longer, shorter, or partially overlapping). Because the Venn is generated by overlapping genomic coordinates of the ncMAPs, the original counts for each study are listed from left to right (i.e., on the right-hand side of panel C, the notation 29/41 refers to 29 instances for Chong and colleagues 2020 and 41 for COD-dipp). A, Comparison with peptide-PRISM published ncMAPs at a 10% FDR. COD-dipp ncMAPs were restricted to 3 studies in common with Erhard and colleagues 2020. B, Comparison with peptide-PRISM published ncMAPs at a 1% FDR. COD-dipp ncMAPs were restricted to 3 studies in common with Erhard and colleagues 2020. C, Comparison of the atlas of ncMAPs revealed by COD-dipp to 3 previous studies.

Close modal

To contextualize our findings from COD-dipp within the existing literature on ncMAPs, we compared our results with those of three previous studies: (i) Laumont and colleagues 2016 (13), (ii) Chong and colleagues 2020 (6), and (iii) Ouspenskaia and colleagues 2021 (16), as shown in Fig. 4C. We used the same mapping procedure that was applied to peptide-PRISME results. We eliminated a fraction of sequences mapping to known proteins, which was 4%, 5%, and 3% of sequences for Chong and colleagues 2020, Laumont and colleagues 2016, and Ouspenskaia and colleagues 2021, respectively (see Fig. 4C left-hand side). Fig. 4C left-hand side shows in navy the fractions of ncMAPS that were successfully mapped to the 3FT database, which was 34.38% for Chong and colleagues 2020, 63.69% for Laumont and colleagues 2016, and 72.74% for Ouspenskaia and colleagues 2021. The remaining ncMAPs that did not align (Fig. 4C left-hand side in yellow) to any of the human proteome or the 3FT database originate from sources not studied herein. For instance, Laumont and colleagues 2016 included 6-frame translation in their MS search database, which accounts for intergenic regions, antisense translation, long noncoding RNA, and retroelement sources. Both Chong and colleagues 2020 and Ouspenskaia and colleagues 2021 added Ribo-Seq detected proteins to their MS database searches, accounting for all possible nORFs, even those outside of known genes. The fractions successfully mapped to the 3FT database (navy) from these three studies, along with the 8,601 ncMAPs from COD-dipp, were then compared, as shown in Fig. 4C right-hand side (Supplementary Table S4). Intersections with COD-dipp were 31.42% for Chong and colleagues 2020, 38.3% for Ouspenskaia and colleagues 2021, and 45.8% for Laumont and colleagues 2016, respectively. In contrast, intersections with all other studies were 40% for Chong and colleagues 2020, 38.66% for Ouspenskaia and colleagues 2021, and 65.93% for Laumont and colleagues 2016. Hence, COD-dipp ncMAPs alone accounted for 78.55% of Chong and colleagues 2020’s intersection, 96.07% of Ouspenskaia and colleagues 2021’s intersection, and 69.47% of Laumont and colleagues 2016’s intersection. COD-dipp ncMAPs accounted, on average, for 81.36% of the intersection when comparing three previously published ncMAP sets, thus validating our approach. With 2,168 ncMAPs (25%) shared with the literature and 6,433 new ncMAPs, we have revealed an atlas of noncanonical MHC class I presentation.

Properties and origins of ncMAPs

We compared the sequence lengths of canonical and noncanonical MAPs (Fig. 5A) and found them to be similar, with a slight skew of the noncanonical category toward longer lengths. This could be due either to an actual preference of ncMAPs toward longer sequence lengths or simply the consequence of requiring 3 amino acid differences from any known proteins favoring longer sequences. Next, we inspected ncMAPs according to their relative positions within protein-coding genes (Fig. 5B). Exonic regions translated in alternative frames were the main source of ncMAPs (19.2%). These events could arise from frameshift mutations, initiation codon readthrough (58), nORFs, or ribosomal slippage (11) during translation (i.e., ribosome frameshifting). Intronic regions were the second most abundant source of ncMAPs (12.2%). These events can arise from frameshift mutations, nORFs, or IR. Interestingly, 5′-UTRs contributed to 10.2% of ncMAPs and have been shown to produce translation products through upstream ORFs along with a non-AUG start codon (59). Finally, 3′-UTRs contributed the least toward ncMAPs (3.2%), potentially through stop codon read-through (60). It is important to note that these categorizations are not mutually exclusive and that an ncMAP sequence may have multiple assignments due to the overlapping nature of transcripts. We conducted three analyses to estimate how well the nORFs (i), IR (ii), and frameshift mutations (iii) could explain the detected ncMAPs. (i) ncMAPs with upstream start codons (AUG, CUG, UUG, GUG, and ACG) accounted for 63.4%, and 41.5% were predicted to be TIS using TITER (Fig. 5C, left-hand side; ref. 47). The breakdown of the TIS start codon distribution (Fig. 5C, right-hand side) showed CUG (L) as the most abundant nORF start codon, and 70% of the predicted TIS showed non-AUG start codons, in line with previous findings (15). (ii) Translation frames of ncMAPs from intronic regions were checked for compatibility with upstream exons, and 49.4% were found in-frame with upstream exons, making IR events a possible source (Fig. 5D). (iii) A total of 597 ncMAPs were found in aberrant proteins from frameshift mutations in cancer (Fig. 5E and Supplementary Table S5; ref. 41). Eventually, 70.1% of ncMAPs were explicable by novel ORFs, IR, or frameshift mutations (Fig. 5F). ncMAPs were found to be presented by all 113 alleles in our dataset, except for the HLA-C*07:17 allele, mostly because of low sample coverage by MS for this allele (see Supplementary Fig. S6). Furthermore, the average noncanonical presentation per HLA supertype (61) was 1%, except for A03, which was 5% (see Supplementary Fig. S6).

Figure 5.

Origins of ncMAPs. A, Peptide length distribution of canonical (dark gray) and noncanonical (light gray) MAPs. B, Annotation of ncMAPs across gene features. C, Analysis of ncMAPs that could originate from nORF. Upstream start codons of noncanonical MAPs are analyzed for their potential to initiate translation and produce ORFs (left-hand side) as a source of ncMAPs. The frequencies of different start codons for positively predicted TIS are shown on the right-hand side. D, Analysis of ncMAPs from intronic regions that may originate from IR events. Translation of MAPs from IR sources should be in-frame with the corresponding upstream exons. E, Analysis of ncMAPs that could originate from frameshift mutations in cancer. ncMAPs are aligned to an in-silico translated protein database of COSMIC somatic frameshift mutations. F, Summary indicating whether the ncMAPs can be accounted for by any of the analyses conducted in panels C, D, or E.

Figure 5.

Origins of ncMAPs. A, Peptide length distribution of canonical (dark gray) and noncanonical (light gray) MAPs. B, Annotation of ncMAPs across gene features. C, Analysis of ncMAPs that could originate from nORF. Upstream start codons of noncanonical MAPs are analyzed for their potential to initiate translation and produce ORFs (left-hand side) as a source of ncMAPs. The frequencies of different start codons for positively predicted TIS are shown on the right-hand side. D, Analysis of ncMAPs from intronic regions that may originate from IR events. Translation of MAPs from IR sources should be in-frame with the corresponding upstream exons. E, Analysis of ncMAPs that could originate from frameshift mutations in cancer. ncMAPs are aligned to an in-silico translated protein database of COSMIC somatic frameshift mutations. F, Summary indicating whether the ncMAPs can be accounted for by any of the analyses conducted in panels C, D, or E.

Close modal

Cancer selectivity of ncMAPs

Of the 8,601 identified ncMAPs, 2,758 were detected in the panel of normal healthy tissue by MS and were labeled as noncancer–selective. The panel of normals originally consisted of healthy MS samples from all considered studies, which we extended by adding the HLA Ligand Atlas (38), a pan-tissue immunopeptidomic reference of 30 healthy tissue types obtained from 21 human subjects. Fig. 6A shows the ability of the extended panel of normals to capture several more ncMAPs (12.85%) in healthy tissues that were not observed in our original panel of normals (19.22%). We assessed the coverage of tumor-only HLA alleles in healthy samples using the panel of normal samples. The 334 healthy samples covered 53% of the HLA alleles expressed in the tumor samples. Analysis of a subset of ncMAPs represented by the 57 shared alleles (i.e., present in both healthy and tumor samples) revealed a substantial overlap in HLA-binding motifs between the panel of normal samples and other samples. This was demonstrated by the majority of identified ncMAPs being retained (7,513 of 8,601) and a comparable percentage of ncMAPs being detected in healthy samples through MS (36.46% with shared alleles versus 32% with all alleles; see Supplementary Fig. S7). To better understand the similarity between the HLA-binding motifs of the alleles represented in tumor-only samples and those represented in healthy samples, we generated a matrix of cosine distances of binding affinities and used t-SNE to reduce the dimensionality and visualize the data. Our results indicated a high level of similarity between the two, further supporting the notion that the 65 alleles in the panel of normal samples were representative of the tumor-only alleles (Supplementary Fig. S7).

Figure 6.

Cancer selectivity of ncMAPs. A, Percentage of ncMAPs that were solely in healthy and/or tumor samples by MS (blue) and ncMAPs undetected in healthy samples by MS (red). B, Parent gene expression of ncMAPs in TPM in 29 healthy tissues from 17,382 samples (GTEx v8 dataset). ncMAPs are distributed over two groups: ncMAPs detected in healthy samples by MS in blue, and ncMAPs undetected in healthy samples by MS in red. C, Parent gene expression of ncMAPs in TPM in 29 healthy tissues from 17,382 samples (GTEx v8 dataset). A limit on the gene expression (y-axis) of 1.2 TPM was set to visualize cancer-selective ncMAPs in black.

Figure 6.

Cancer selectivity of ncMAPs. A, Percentage of ncMAPs that were solely in healthy and/or tumor samples by MS (blue) and ncMAPs undetected in healthy samples by MS (red). B, Parent gene expression of ncMAPs in TPM in 29 healthy tissues from 17,382 samples (GTEx v8 dataset). ncMAPs are distributed over two groups: ncMAPs detected in healthy samples by MS in blue, and ncMAPs undetected in healthy samples by MS in red. C, Parent gene expression of ncMAPs in TPM in 29 healthy tissues from 17,382 samples (GTEx v8 dataset). A limit on the gene expression (y-axis) of 1.2 TPM was set to visualize cancer-selective ncMAPs in black.

Close modal

However, the lack of ncMAP detection in the panel of normals does not confirm cancer selectivity owing to the sensitivity limitations of MS. Proper cancer selectivity assessment should be performed at the gene expression level in healthy tissues. Hence, we retrieved the parent gene expression values (in TPM) of the remaining ncMAPs from the Genotype-Tissue Expression project (GTEx v8; ref. 51). We first compared the gene expression levels of the following two groups: ncMAPs detected in the panel of normals by MS (blue), and remaining ncMAPs without detection in the panel of normals (red). Fig. 6B shows significantly higher gene expression for ncMAPs detected in healthy tissues (blue) than for those that were left undetected (red). Moreover, to ensure low toxicity levels in normal tissues, we filtered ncMAPs to retain those with parent genes expressed below 1 TPM and without evidence of protein expression in any healthy tissue except the testis (immune-privileged site; Fig. 6C). By applying this stringent filter, we identified 24 ncMAPs derived from genes not expressed or expressed only in trace amounts in healthy tissues. Of these, 17 were associated with proteins not detected in healthy tissues. Table 1 provides a summary of these 17 cancer-selective ncMAPs, which we suggest as promising targets for clinical applications (see Supplementary Table S3 for more details).

Table 1.

List of cancer-selective noncanonical MHC-associated peptides.

IDPeptideGene nameMean expression in healthy tissues (TPM)Number of healthy tissues with protein expressionAnnotation
AFAPFPTQF CXorf49B 0.01 0 of 56 Cancer selective 
AFAPFPTQF CXorf49 0.01 0 of 56 Cancer selective 
AFAPFPTQF RP11–402P6.15 0.10 0 of 56 Cancer selective 
DYIHFVHHF RP11–325B23.2 0.00 0 of 56 Cancer selective 
EALSASQALYTR HIST1H4L 0.04 43 of 56  
ELIKAFSK GNGT1 0.05 1 of 56  
ESAGLFQVPR SUN3 0.13 3 of 56  
EVEKILIQY KCNU1 0.05 0 of 56 Cancer selective 
EVPGAQGQQGPR CTAG2 0.15 0 of 56 Cancer selective 
EVPGAQGQQGPR CTAG1B 0.03 0 of 56 Cancer selective 
EVPGAQGQQGPR CTAG1A 0.06 0 of 56 Cancer selective 
FPVDVDHAVL CTAG2 0.15 0 of 56 Cancer selective 
FPVDVDHAVL CTAG1B 0.03 0 of 56 Cancer selective 
FPVDVDHAVL CTAG1A 0.06 0 of 56 Cancer selective 
ILSDNIRNL C1orf94 0.14 0 of 56 Cancer selective 
10 IPKDKSKNK C2orf83 0.02 0 of 56 Cancer selective 
11 KLLELIKAFSK GNGT1 0.05 1 of 56  
12 KNNIYAFKI RP11–231I13.2 0.01 0 of 56 Cancer selective 
13 KTLHLTIVK C12orf50 0.07 0 of 56 Cancer selective 
14 KYLSRFRPK TRPC5 0.08 0 of 56 Cancer selective 
15 MVRSPEEGSLR TEX19 0.13 0 of 56 Cancer selective 
16 MVRSVSAAAR HIST1H2BB 0.26 44 of 56  
17 MVRSVSAAARR HIST1H2BB 0.26 44 of 56  
18 REEAPRGVRM CTAG2 0.15 0 of 56 Cancer selective 
18 REEAPRGVRM CTAG1B 0.03 0 of 56 Cancer selective 
18 REEAPRGVRM CTAG1A 0.06 0 of 56 Cancer selective 
19 SAGLFQVPR SUN3 0.13 3 of 56  
20 SQVHKFFLL OR9Q1 0.04 0 of 56 Cancer selective 
21 SYGIYIYTY SLC15A5 0.06 0 of 56 Cancer selective 
22 TVSHQIIFY EXD1 0.06 0 of 56 Cancer selective 
23 VIQKVILVV MGAT4D 0.03 0 of 56 Cancer selective 
24 YYFILEHAKY SOX30 0.29 0 of 56 Cancer selective 
IDPeptideGene nameMean expression in healthy tissues (TPM)Number of healthy tissues with protein expressionAnnotation
AFAPFPTQF CXorf49B 0.01 0 of 56 Cancer selective 
AFAPFPTQF CXorf49 0.01 0 of 56 Cancer selective 
AFAPFPTQF RP11–402P6.15 0.10 0 of 56 Cancer selective 
DYIHFVHHF RP11–325B23.2 0.00 0 of 56 Cancer selective 
EALSASQALYTR HIST1H4L 0.04 43 of 56  
ELIKAFSK GNGT1 0.05 1 of 56  
ESAGLFQVPR SUN3 0.13 3 of 56  
EVEKILIQY KCNU1 0.05 0 of 56 Cancer selective 
EVPGAQGQQGPR CTAG2 0.15 0 of 56 Cancer selective 
EVPGAQGQQGPR CTAG1B 0.03 0 of 56 Cancer selective 
EVPGAQGQQGPR CTAG1A 0.06 0 of 56 Cancer selective 
FPVDVDHAVL CTAG2 0.15 0 of 56 Cancer selective 
FPVDVDHAVL CTAG1B 0.03 0 of 56 Cancer selective 
FPVDVDHAVL CTAG1A 0.06 0 of 56 Cancer selective 
ILSDNIRNL C1orf94 0.14 0 of 56 Cancer selective 
10 IPKDKSKNK C2orf83 0.02 0 of 56 Cancer selective 
11 KLLELIKAFSK GNGT1 0.05 1 of 56  
12 KNNIYAFKI RP11–231I13.2 0.01 0 of 56 Cancer selective 
13 KTLHLTIVK C12orf50 0.07 0 of 56 Cancer selective 
14 KYLSRFRPK TRPC5 0.08 0 of 56 Cancer selective 
15 MVRSPEEGSLR TEX19 0.13 0 of 56 Cancer selective 
16 MVRSVSAAAR HIST1H2BB 0.26 44 of 56  
17 MVRSVSAAARR HIST1H2BB 0.26 44 of 56  
18 REEAPRGVRM CTAG2 0.15 0 of 56 Cancer selective 
18 REEAPRGVRM CTAG1B 0.03 0 of 56 Cancer selective 
18 REEAPRGVRM CTAG1A 0.06 0 of 56 Cancer selective 
19 SAGLFQVPR SUN3 0.13 3 of 56  
20 SQVHKFFLL OR9Q1 0.04 0 of 56 Cancer selective 
21 SYGIYIYTY SLC15A5 0.06 0 of 56 Cancer selective 
22 TVSHQIIFY EXD1 0.06 0 of 56 Cancer selective 
23 VIQKVILVV MGAT4D 0.03 0 of 56 Cancer selective 
24 YYFILEHAKY SOX30 0.29 0 of 56 Cancer selective 

Note: The mean parent gene expression in TPM was derived from 29 healthy tissues from the GTEx v8 dataset. The number of healthy tissues with protein expression was obtained from the Human Protein Atlas v22.0.

The cartography of noncanonical antigen presentation revealed in our study arose from a harmonized large-scale analysis of immunopeptidomic data mapped to the human genome. Our innovations over the most recent trends in computational MS identified a diversity of peptides mapping to canonical and noncanonical translation products. We mapped deviations away from the reference proteome as mass shifts (PTMs) and applied a sequential approach to tackle the noncanonical immunopeptidome. Our proteogenomic pipeline allowed the identification of thousands of ncMAPs (8,601) derived from noncoding regions of protein-coding genes with an FDR of 1%. This was accomplished by analyzing a large collection of publicly available studies using COD-dipp, a highly modular large-scale pipeline that bypasses the challenge of multi-omics requirements and large MS databases when identifying ncMAPs.

Recent studies have suggested that the immunopeptidome is rich in PTMs (62), which can have profound effects on immune tolerance. T cells can discriminate between modified and nonmodified epitopes, which has been demonstrated in the case of ubiquitination (63), glycosylation (64), phosphorylation (1, 65). T-cell reactivity to PTMs is an effect of their central tolerance escape from the thymus (66). PTMs may also alter proteolytic activity, and consequently, peptide presentation by the MHC system (67). The open-search component sheds light on several PTMs implicated in immunogenicity (serine N-terminal acetylation, cysteinylation, and cysteine tri-oxidation) and could provide insights for future studies on PTM-based epitopes. For instance, tri-oxidation of cysteine has the potential to alter the immune response (55); however, its mechanism of interaction with HLA molecules and T cells is still in its infancy. Additionally, T cells can discriminate between cysteinylated and unmodified cysteine residues (56). Likewise, N-terminal serine acetylation is known for multifunctional regulation, acting as a protein degradation signal, inhibitor of endoplasmic reticulum (ER) translocation, and mediator of protein complex formation. Methionine sulfone (methionine dioxidation) has been found to occur in vivo in Proteus mirabilis (68), a Gram-negative bacterium present in malignant cancers (69), although it can result from the use of a strong oxidizing agent.

The validity of ncMAPs was rigorously tested using retention time correlation (experimental vs. theoretical), orthogonal second-round search, mass accuracy, PTM retention time shifts, HLA binding prediction, and PSM comparison with previously published results. Twenty-five percent of the identified ncMAPs accounted, on average, for 81.36% of intersections when compared with three other high-profile studies (6, 13, 16). In addition, COD-dipp revealed 6,433 new ncMAPs from protein-coding genes. Considering the high-quality and rigorous computational validation, the identification rate discrepancy is partly due to the performance of COD-dipp and the size of our dataset collection, making it the most exhaustive noncanonical library of MHC class I–associated peptides.

Our survey of the possible sources of ncMAPs revealed that 70.1% could be attributed to nORFs, IR, or frameshift mutations. We identified 597 ncMAPs downstream of known frameshift mutations in COSMIC, an understudied source of antigens in immunopeptidomic studies. Certainly, other biological processes not accounted for in this study could generate ncMAPs. For instance, mechanisms such as ribosomal slippage (11) and stop codon readthrough could explain some of the remaining ncMAPs (29.9%).

This study focuses on peptides from noncoding regions of the genome, referred to as noncanonical peptides. Unlike neoantigens, which derive from patient-specific mutations in cancer, these noncanonical peptides are not mutated and are present in both cancer and healthy individuals. Although their presence in healthy samples makes their tumor specificity less clear, noncanonical peptides tend to be more abundant in cancer cells than in healthy cells. Over two decades ago, Ishii and colleagues (70) purified an octamer noncanonical antigen (IPGLPLSL or pRL1a) associated with heat shock proteins (HSP) and validated their findings using MS. The isolated octamer noncanonical antigen pRL1a was derived from the 5′-untranslated region of the AKT gene in leukemia and induced tumor rejection. To the best of our knowledge, this was the first demonstration of a noncanonical antigen that confers immunity. Subsequent studies have shown that HSPs are beneficial for anticancer vaccines (71) because they bind canonical/noncanonical antigens with tumor rejection properties that end up being presented by MHC I and II molecules (72).

Numerous studies have suggested various possible candidates for cancer vaccines over the past two to three decades, and each has failed, at least partly, due to the issue of specificity. We used a conservative definition of cancer selectivity that follows three iterative steps. We searched for the identified ncMAPs over a panel of 334 normal MS samples and confirmed a fraction (32%) of noncancer–selective ncMAPs. The remaining fraction (5,843, 68%) contained both cancer-selective ncMAPs and noncancer–selective ncMAPs that were not detected by MS. We used the expression levels of the ncMAPs’ parent genes across 29 healthy tissues as a means of prioritization (6, 14, 16). ncMAPs whose parent genes were expressed in any normal tissue above a threshold of 1 TPM were not considered cancer selective. However, we caution that this definition excludes the consideration of 92% of protein-coding genes. We revealed 17 rigidly defined candidates as cancer-selective ncMAPs, originating from genes and proteins that were completely absent or available in trace amounts in healthy tissues. We hope that this offers a sufficiently stringent approach to reducing toxicity in clinical applications. We provide a complete breakdown of all detected ncMAPs in Supplementary Table S3. We report the parent gene and protein expression values across healthy tissue types from the GTEx cohort and Human Protein Atlas, respectively. Moreover, we report their cancer-selectivity status conditioned on a gene expression cutoff (1 TPM) and lack of protein expression in healthy tissues. This will allow the research community to make decisions regarding the peptides that should be retained or removed from their analyses. It is particularly important that we do not filter all peptides, as aberrant intron-retention and frame-shift mutations that are certainly cancer-specific may lie within these results and would not need this stringent filtering if found in subsequent studies.

Here, we provide a free and open-source informatics pipeline to study noncanonical peptides, along with a reservoir of potential targets that could be used in combination with T-cell therapies or cancer vaccines. We anticipate that this will help pave the way for future research on antigens from noncanonical sources and engage further oncology research on alternative sources of antigens.

We acknowledge that our study presents several limitations. First, our approach relies on a DDA MS, which is known for its dynamic range limitations. Thus, only the most abundant ncMAPs were identified. Moreover, owing to the technical limitations of MS, we require that our ncMAPs be at least 3 amino acids different from any known human protein. Thus, a substantial fraction could be eliminated, leading to underestimation of the noncanonical fraction. Second, because immunogenicity prediction is still in its infancy, the identified ncMAPs require further validation to qualify as tumor rejection–mediating antigens for clinical applications. Despite our efforts to identify cancer-selective targets, the toxicity of these peptides in healthy tissues requires further investigation.

D.J. Harrison reports grants, personal fees, and other support from Nucana PLC; and personal fees from ILC Therapeutics Ltd outside the submitted work. A. Palkowski reports The study was supported by the KATY project, which has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 101017453 and by the International Centre for Cancer Vaccine Science project, which is carried out within the International Research Agendas programme of the Foundation for Polish Science cofinanced by the European Union under the European Regional Development Fund. The authors would like to thank CI TASK and the PLGrid infrastructure (Poland) for providing their hardware and software resources. K. Litchfield reports personal fees from Roche Tissue Diagnostics, Monopteros Therapeutics, Kynos Therapeutics; grants from CRUK TDL/Ono/LifeArc alliance; and grants from Genesis Therapeutics outside the submitted work. S. Kote reports this study was supported by (i) the project ‘International Centre for Cancer Vaccine Science’ which is carried out within the International Agendas Program of the Foundation for Polish Science, cofinanced by the European Union under the European Regional Development Fund, (ii) the KATY Consortium which is supported by the EU programme Horizon 2020/H2020-SCI-FA-DTS-2020-1 (contract number 101017453), and (iii) the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. The authors would like to thank ‘CI-TASK, Gdansk’, and the ‘PL-Grid Infrastructure, Poland’ for providing their hardware and software resources. J.A. Alfaro reports this study was supported by (i) the project ‘International Centre for Cancer Vaccine Science’ which is carried out within the International Agendas Program of the Foundation for Polish Science, cofinanced by the European Union under the European Regional Development Fund, (ii) the KATY Consortium which is supported by the EU program Horizon 2020/H2020-SCI-FA-DTS-2020-1 (contract number 101017453), and (iii) the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. The authors would like to thank ‘CI-TASK, Gdansk’, and the ‘PL-Grid Infrastructure, Poland’ for providing their hardware and software resources. No disclosures were reported by the other authors.

G. Bedran: Conceptualization, resources, data curation, software, formal analysis, validation, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing. H.-C. Gasser: Formal analysis, visualization. K. Weke: Writing–review and editing. T. Wang: Formal analysis. D. Bedran: Writing–review and editing. A. Laird: Writing–review and editing. C. Battail: Writing–review and editing. F.M. Zanzotto: Writing–review and editing. C. Pesquita: Writing–review and editing. H. Axelson: Writing–review and editing. A. Rajan: Writing–review and editing. D.J. Harrison: Writing–review and editing. A. Palkowski: Writing–review and editing. M. Pawlik: Writing–review and editing. M. Parys: Writing–review and editing. J.R. O'Neill: Writing–review and editing. P.M. Brennan: Writing–review and editing. S.N. Symeonides: Writing–review and editing. D.R. Goodlett: Writing–review and editing. K. Litchfield: Writing–review and editing. R. Fahraeus: Writing–review and editing. T.R. Hupp: Writing–review and editing. S. Kote: Supervision, writing–review and editing. J.A. Alfaro: Conceptualization, supervision, visualization, writing–original draft, writing–review and editing.

G. Bedran, D. Bedran, K. Weke, A. Palkowski, R. Fahraeus, T.R. Hupp, S. Kote, and J.A. Alfaro received support from Fundacja na rzecz Nauki Polskiej (FNP; grant ID: MAB/3/2017). D.R. Goodlett received support from Genome Canada & Genome BC (grant ID: 264PRO). D.J. Harrison received support from NuCana plc (grant ID: SMD0-ZIUN05). H. Axelson received support from Swedish Cancer Foundation (grant ID: 211709). H.C. Gasser received support from United Kingdom Research and Innovation (UKRI; grant ID: EP/S02431X/1). C. Pesquita received support from Fundação para a Ciência e a Tecnologia (FCT) through LASIGE Research Unit (grant ID: UIDB/00408/2020 and UIDP/00408/2020). A. Laird, F.M. Zanzotto, C. Pesquita, A. Rajan, A. Palkowski, and J.A. Alfaro received support from European Union's Horizon 2020 research and innovation programme (grant ID: 101017453). C. Battail received support from Agence Nationale de la Recherche (ANR) through GRAL LabEX (grant ID: ANR-10-LABX-49–01) and CBH-EUR-GS 32 (grant ID: ANR-17-EURE-0003). S.N. Symeonides received support from Cancer Research UK (CRUK) and the Chief Scientist's Office of Scotland (CSO): Experimental Cancer Medicine Centre (ECMC; grant ID: ECMCQQR-2022/100017). A. Laird received support from Chief Scientist's Office of Scotland (CSO) NRS Career Researcher Fellowship. R. O'Neill received support from CRUK Cambridge Centre Thoracic Cancer Programme (grant ID: CTRQQR-2021\100012).

This work was supported by the International Centre for Cancer Vaccine Science (Fundacja na rzecz Nauki Polskiej: MAB/3/2017) project is carried out within the International Research Agendas programme of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017453. This work was supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. For the purpose of open access, the author has applied a creative commons attribution (CC BY) licence to any author accepted manuscript version arising. The authors would like to thank CI-TASK, Gdansk, and the PLGrid Infrastructure, Poland for providing their hardware and software resources.

The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.

Note: Supplementary data for this article are available at Cancer Immunology Research Online (http://cancerimmunolres.aacrjournals.org/).

1.
Bassani-Sternberg
M
,
Bräunlein
E
,
Klar
R
,
Engleitner
T
,
Sinitcyn
P
,
Audehm
S
, et al
.
Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry
.
Nat Commun
2016
;
7
:
1
16
.
2.
Newey
A
,
Griffiths
B
,
Michaux
J
,
Pak
HS
,
Stevenson
BJ
,
Woolston
A
, et al
.
Immunopeptidomics of colorectal cancer organoids reveals a sparse HLA class I neoantigen landscape and no increase in neoantigens with interferon or MEK-inhibitor treatment
.
J Immunother Cancer
2019
;
7
:
309
.
3.
Ebrahimi-Nik
H
,
Michaux
J
,
Corwin
WL
,
Keller
GLJ
,
Shcheglova
T
,
Pak
H
, et al
.
Mass spectrometry–driven exploration reveals nuances of neoepitope-driven tumor rejection
.
JCI Insight
2019
;
4
:
e129152
.
4.
Blass
E
,
Ott
PA
.
Advances in the development of personalized neoantigen-based therapeutic cancer vaccines
.
Nat Rev Clin Oncol
2021
;
18
:
215
29
.
5.
Pearlman
AH
,
Hwang
MS
,
Konig
MF
,
Hsiue
EH-C
,
Douglass
J
,
DiNapoli
SR
, et al
.
Targeting public neoantigens for cancer immunotherapy
.
Nat Cancer
2021
;
2
:
487
97
.
6.
Chong
C
,
Müller
M
,
Pak
H
,
Harnett
D
,
Huber
F
,
Grun
D
, et al
.
Integrated proteogenomic deep sequencing and analytics accurately identify noncanonical peptides in tumor immunopeptidomes
.
Nat Commun
2020
;
11
:
1293
.
7.
Malabat
C
,
Feuerbach
F
,
Ma
L
,
Saveanu
C
,
Jacquier
A
.
Quality control of transcription start site selection by nonsense-mediated-mRNA decay
.
eLife
2015
;
4
:
e06722
.
8.
Aspden
JL
,
Eyre-Walker
YC
,
Phillips
RJ
,
Amin
U
,
Mumtaz
MAS
,
Brocard
M
, et al
.
Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq
.
eLife
2014
;
3
:
e03528
.
9.
Rivero-Hinojosa
S
,
Grant
M
,
Panigrahi
A
,
Zhang
H
,
Caisova
V
,
Bollard
CM
, et al
.
Proteogenomic discovery of neoantigens facilitates personalized multi-antigen targeted T-cell immunotherapy for brain tumors
.
Nat Commun
2021
;
12
:
6689
.
10.
Smart
AC
,
Margolis
CA
,
Pimentel
H
,
He
MX
,
Miao
D
,
Adeegbe
D
, et al
.
Intron retention is a source of neoepitopes in cancer
.
Nat Biotechnol
2018
;
36
:
1056
8
.
11.
Zook
MB
,
Howard
MT
,
Sinnathamby
G
,
Atkins
JF
,
Eisenlohr
LC
.
Epitopes derived by incidental translational frameshifting give rise to a protective CTL response
.
J Immunol
2006
;
176
:
6928
34
.
12.
Fang
W
,
Wu
C-H
,
Sun
Q-L
,
Gu
Z-T
,
Zhu
L
,
Mao
T
, et al
.
Novel tumor-specific antigens for immunotherapy identified from multi-omics profiling in thymic carcinomas
.
Front Immunol
2021
;
12
:
748820
.
13.
Laumont
CM
,
Daouda
T
,
Laverdure
J-P
,
Bonneil
E
,
Caron-Lizotte
O
,
Hardy
M-P
, et al
.
Global proteogenomic analysis of human MHC class I—associated peptides derived from noncanonical reading frames
.
Nat Commun
2016
;
7
:
10238
.
14.
Laumont
CM
,
Vincent
K
,
Hesnard
L
,
Audemard
E
,
Bonneil
E
,
Laverdure
J-P
, et al
.
Noncoding regions are the main source of targetable tumor-specific antigens
.
Sci Transl Med
2018
;
10
:
eaau5516
.
15.
Ruiz
Cuevas MV
,
Hardy
M-P
,
Hollý
J
,
Bonneil
É
,
Durette
C
,
Courcelles
M
, et al
.
Most noncanonical proteins uniquely populate the proteome or immunopeptidome
.
Cell Rep
2021
;
34
:
108815
.
16.
Ouspenskaia
T
,
Law
T
,
Clauser
KR
,
Klaeger
S
,
Sarkizova
S
,
Aguet
F
, et al
.
Unannotated proteins expand the MHC-I-restricted immunopeptidome in cancer
.
Nat Biotechnol
2022
;
40
:
209
17
.
17.
Olsson
N
,
Schultz
LM
,
Zhang
L
,
Khodadoust
MS
,
Narayan
R
,
Czerwinski
DK
, et al
.
T-cell immunopeptidomes reveal cell subtype surface markers derived from intracellular proteins
.
Proteomics
2018
;
18
:
1700410
.
18.
Demmers
LC
,
Heck
AJR
,
Wu
W
.
Pre-fractionation extends but also creates a bias in the detectable HLA Class Ι ligandome
.
J Proteome Res
2019
;
18
:
1634
43
.
19.
Khodadoust
MS
,
Olsson
N
,
Chen
B
,
Sworder
B
,
Shree
T
,
Liu
CL
, et al
.
B-cell lymphomas present immunoglobulin neoantigens
.
Blood
2019
;
133
:
878
81
.
20.
Komov
L
,
Kadosh
DM
,
Barnea
E
,
Milner
E
,
Hendler
A
,
Admon
A
.
Cell surface MHC class I expression is limited by the availability of peptide-receptive “empty” molecules rather than by the supply of peptide ligands
.
Proteomics
2018
;
18
:
1700248
.
21.
Zeiner
PS
,
Zinke
J
,
Kowalewski
DJ
,
Bernatz
S
,
Tichy
J
,
Ronellenfitsch
MW
, et al
.
CD74 regulates complexity of tumor cell HLA class II peptidome in brain metastasis and is a positive prognostic marker for patient survival
.
Acta Neuropathol Commun
2018
;
6
:
18
.
22.
Bichmann
L
,
Nelde
A
,
Ghosh
M
,
Heumos
L
,
Mohr
C
,
Peltzer
A
, et al
.
MHCquant: automated and reproducible data analysis for immunopeptidomics
.
J Proteome Res
2019
;
18
:
3876
84
.
23.
Chong
C
,
Marino
F
,
Pak
H
,
Racle
J
,
Daniel
RT
,
Müller
M
, et al
.
High-throughput and sensitive immunopeptidomics platform reveals profound interferonγ-mediated remodeling of the human leukocyte antigen (HLA) ligandome
.
Mol Cell Proteomics
2018
;
17
:
533
48
.
24.
Koumantou
D
,
Barnea
E
,
Martin-Esteban
A
,
Maben
Z
,
Papakyriakou
A
,
Mpakali
A
, et al
.
Editing the immunopeptidome of melanoma cells using a potent inhibitor of endoplasmic reticulum aminopeptidase 1 (ERAP1)
.
Cancer Immunol Immunother
2019
;
68
:
1245
61
.
25.
Marino
F
,
Mommen
GPM
,
Jeko
A
,
Meiring
HD
,
van Gaans-van den Brink
JAM
,
Scheltema
RA
, et al
.
Arginine (Di)methylated human leukocyte antigen class I peptides are favorably presented by HLA-B*07
.
J Proteome Res
2017
;
16
:
34
44
.
26.
Narayan
R
,
Olsson
N
,
Wagar
LE
,
Medeiros
BC
,
Meyer
E
,
Czerwinski
D
, et al
.
Acute myeloid leukemia immunopeptidome reveals HLA presentation of mutated nucleophosmin
.
PLoS One
2019
;
14
:
e0219547
.
27.
Shraibman
B
,
Kadosh
DM
,
Barnea
E
,
Admon
A
.
Human leukocyte antigen (HLA) peptides derived from tumor antigens induced by inhibition of DNA methylation for development of drug-facilitated immunotherapy
.
Mol Cell Proteomics
2016
;
15
:
3058
70
.
28.
Abelin
JG
,
Keskin
DB
,
Sarkizova
S
,
Hartigan
CR
,
Zhang
W
,
Sidney
J
, et al
.
Mass spectrometry profiling of HLA-associated peptidomes in mono-allelic cells enables more accurate epitope prediction
.
Immunity
2017
;
46
:
315
26
.
29.
Di Marco
M
,
Schuster
H
,
Backert
L
,
Ghosh
M
,
Rammensee
H-G
,
Stevanovic
S
.
Unveiling the peptide motifs of HLA-C and HLA-G from naturally presented peptides and generation of binding prediction matrices
.
J Immunol Baltim Md 1950
2017
;
199
:
2639
51
.
30.
Bassani-Sternberg
M
,
Pletscher-Frankild
S
,
Jensen
LJ
,
Mann
M
.
Mass spectrometry of human leukocyte antigen class I peptidomes reveals strong effects of protein abundance and turnover on antigen presentation
.
Mol Cell Proteomics MCP
2015
;
14
:
658
73
.
31.
Khodadoust
MS
,
Olsson
N
,
Wagar
LE
,
Haabeth
OAW
,
Chen
B
,
Swaminathan
K
, et al
.
Antigen presentation profiling reveals recognition of lymphoma immunoglobulin neoantigens
.
Nature
2017
;
543
:
723
7
.
32.
Andreatta
M
,
Nicastri
A
,
Peng
X
,
Hancock
G
,
Dorrell
L
,
Ternette
N
, et al
.
MS-Rescue: a computational pipeline to increase the quality and yield of immunopeptidomics experiments
.
Proteomics
2019
;
19
:
e1800357
.
33.
Bassani-Sternberg
M
,
Chong
C
,
Guillaume
P
,
Solleder
M
,
Pak
H
,
Gannon
PO
, et al
.
Deciphering HLA-I motifs across HLA peptidomes improves neo-antigen predictions and identifies allostery regulating HLA specificity
.
PLoS Comput Biol
2017
;
13
:
e1005725
.
34.
Ternette
N
,
Olde Nordkamp
MJM
,
Muller
J
,
Anderson
AP
,
Nicastri
A
,
Hill
AVS
, et al
.
Immunopeptidomic profiling of HLA-A2–positive triple-negative breast cancer identifies potential immunotherapy target antigens
.
Proteomics
2018
;
18
:
e1700465
.
35.
Pearson
H
,
Daouda
T
,
Granados
DP
,
Durette
C
,
Bonneil
E
,
Courcelles
M
, et al
.
MHC class I–associated peptides derive from selective regions of the human genome
.
J Clin Invest
2016
;
126
:
4690
701
.
36.
Erhard
F
,
Halenius
A
,
Zimmermann
C
,
L'Hernault
A
,
Kowalewski
DJ
,
Weekes
MP
, et al
.
Improved Ribo-seq enables identification of cryptic translation events
.
Nat Methods
2018
;
15
:
363
6
.
37.
Loffler
MW
,
Mohr
C
,
Bichmann
L
,
Freudenmann
LK
,
Walzer
M
,
Schroeder
CM
, et al
.
Multi-omics discovery of exome-derived neoantigens in hepatocellular carcinoma
.
Genome Med
2019
;
11
:
28
.
38.
Marcu
A
,
Bichmann
L
,
Kuchenbecker
L
,
Kowalewski
DJ
,
Freudenmann
LK
,
Backert
L
, et al
.
HLA ligand atlas: a benign reference of HLA-presented peptides to improve T cell–based cancer immunotherapy
.
J Immunother Cancer
2021
;
9
:
e002071
.
39.
Sarkizova
S
,
Klaeger
S
,
Le
PM
,
Li
LW
,
Oliveira
G
,
Keshishian
H
, et al
.
A large peptidome dataset improves HLA class I epitope prediction across most of the human population
.
Nat Biotechnol
2020
;
38
:
199
209
.
40.
da Veiga Leprevost
F
,
Haynes
SE
,
Avtonomov
DM
,
Chang
H-Y
,
Shanmugam
AK
,
Mellacheruvu
D
, et al
.
Philosopher: a versatile toolkit for shotgun proteomics data analysis
.
Nat Methods
2020
;
17
:
869
70
.
41.
Tate
JG
,
Bamford
S
,
Jubb
HC
,
Sondka
Z
,
Beare
DM
,
Bindal
N
, et al
.
COSMIC: the catalogue of somatic mutations in cancer
.
Nucleic Acids Res
2019
;
47
:
D941
7
.
42.
Kong
AT
,
Leprevost
FV
,
Avtonomov
DM
,
Mellacheruvu
D
,
Nesvizhskii
AI
.
MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics
.
Nat Methods
2017
;
14
:
513
.
43.
An
Z
,
Zhai
L
,
Ying
W
,
Qian
X
,
Gong
F
,
Tan
M
, et al
.
PTMiner: localization and quality control of protein modifications detected in an open search and its application to comprehensive
posttranslational
modification characterization in human proteome*
.
Mol Cell Proteomics
2019
;
18
:
391
405
.
44.
Qiao
R
,
Tran
NH
,
Xin
L
,
Chen
X
,
Li
M
,
Shan
B
, et al
.
Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices
.
Nat Mach Intell
2021
;
3
:
420
5
.
45.
Ivanov
MV
,
Levitsky
LI
,
Bubis
JA
,
Gorshkov
MV
.
Scavager: a versatile postsearch validation algorithm for shotgun proteomics based on gradient boosting
.
Proteomics
2019
;
19
:
1800280
.
46.
Kent
WJ
.
BLAT—The BLAST-Like Alignment Tool
.
Genome Res
2002
;
12
:
656
64
.
47.
Zhang
S
,
Hu
H
,
Jiang
T
,
Zhang
L
,
Zeng
J
.
TITER: predicting translation initiation sites by deep learning
.
Bioinformatics
2017
;
33
:
i234
42
.
48.
Dale
RK
,
Pedersen
BS
,
Quinlan
AR
.
Pybedtools: a flexible Python library for manipulating genomic datasets and annotations
.
Bioinformatics
2011
;
27
:
3423
4
.
49.
Erhard
F
,
Dölken
L
,
Schilling
B
,
Schlosser
A
.
Identification of the cryptic HLA-I immunopeptidome
.
Cancer Immunol Res
2020
;
8
:
1018
26
.
50.
Zhu
LJ
,
Gazin
C
,
Lawson
ND
,
Pagès
H
,
Lin
SM
,
Lapointe
DS
, et al
.
ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data
.
BMC Bioinf
2010
;
11
:
237
.
51.
Carithers
LJ
,
Ardlie
K
,
Barcus
M
,
Branton
PA
,
Britton
A
,
Buia
SA
, et al
.
A novel approach to high-quality postmortem tissue procurement: The GTEx Project
.
Biopreservation Biobanking
2015
;
13
:
311
9
.
52.
Uhlén
M
,
Fagerberg
L
,
Hallström
BM
,
Lindskog
C
,
Oksvold
P
,
Mardinoglu
A
, et al
.
Tissue-based map of the human proteome
.
Science
2015
;
347
:
1260419
.
53.
Kim
S
,
Pevzner
PA
.
MS-GF+ makes progress towards a universal database search tool for proteomics
.
Nat Commun
2014
;
5
:
5277
.
54.
Ojha
R
,
Prajapati
VK
.
Cognizance of posttranslational modifications in vaccines: a way to enhanced immunogenicity
.
J Cell Physiol
2021
;
236
:
8020
34
.
55.
Trujillo
JA
,
Croft
NP
,
Dudek
NL
,
Channappanavar
R
,
Theodossis
A
,
Webb
AI
, et al
.
The cellular redox environment alters antigen presentation
.
J Biol Chem
2014
;
289
:
27979
91
.
56.
Parker
R
,
Partridge
T
,
Wormald
C
,
Kawahara
R
,
Stalls
V
,
Aggelakopoulou
M
, et al
.
Mapping the SARS-CoV-2 spike glycoprotein-derived peptidome presented by HLA class II on dendritic cells
.
Cell Rep
2021
;
35
:
109179
.
57.
Reynisson
B
,
Alvarez
B
,
Paul
S
,
Peters
B
,
Nielsen
M
.
NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data
.
Nucleic Acids Res
2020
;
48
:
W449
54
.
58.
Bullock
TN
,
Eisenlohr
LC
.
Ribosomal scanning past the primary initiation codon as a mechanism for expression of CTL epitopes encoded in alternative reading frames
.
J Exp Med
1996
;
184
:
1319
29
.
59.
Starck
SR
,
Tsai
JC
,
Chen
K
,
Shodiya
M
,
Wang
L
,
Yahiro
K
, et al
.
Translation from the 5′ untranslated region shapes the integrated stress response
.
Science
2016
;
351
:
aad3867
.
60.
Goodenough
E
,
Robinson
TM
,
Zook
MB
,
Flanigan
KM
,
Atkins
JF
,
Howard
MT
, et al
.
Cryptic MHC class I-binding peptides are revealed by aminoglycoside-induced stop codon read-through into the 3′ UTR
.
Proc Natl Acad Sci
2014
;
111
:
5670
5
.
61.
Sidney
J
,
Peters
B
,
Frahm
N
,
Brander
C
,
Sette
A
.
HLA class I supertypes: a revised and updated classification
.
BMC Immunol
2008
;
9
:
1
.
62.
Yi
X
,
Liao
Y
,
Wen
B
,
Li
K
,
Dou
Y
,
Savage
SR
, et al
.
caAtlas: an immunopeptidome atlas of human cancer
.
iScience
2021
;
24
:
103107
.
63.
Gavali
S
,
Liu
J
,
Li
X
,
Paolino
M
.
Ubiquitination in T-cell activation and checkpoint inhibition: new avenues for targeted cancer immunotherapy
.
Int J Mol Sci
2021
;
22
:
10800
.
64.
Malaker
SA
,
Ferracane
MJ
,
Depontieu
FR
,
Zarling
AL
,
Shabanowitz
J
,
Bai
DL
, et al
.
Identification and characterization of complex glycosylated peptides presented by the MHC class II processing pathway in melanoma
.
J Proteome Res
2017
;
16
:
228
37
.
65.
Penny
SA
,
Abelin
JG
,
Malaker
SA
,
Myers
PT
,
Saeed
AZ
,
Steadman
LG
, et al
.
Tumor-infiltrating lymphocytes target HLA-I phosphopeptides derived from cancer signaling in colorectal cancer
.
Front Immunol
2021
;
12
:
723566
.
66.
Raposo
B
,
Merky
P
,
Lundqvist
C
,
Yamada
H
,
Urbonaviciute
V
,
Niaudet
C
, et al
.
T cells specific for posttranslational modifications escape intrathymic tolerance induction
.
Nat Commun
2018
;
9
:
353
.
67.
Kacen
A
,
Javitt
A
,
Kramer
MP
,
Morgenstern
D
,
Tsaban
T
,
Shmueli
MD
, et al
.
Posttranslational modifications reshape the antigenic landscape of the MHC I immunopeptidome in tumors
.
Nat Biotechnol
2022
;
41
:
239
251
68.
Buzy
A
,
Bracchi
V
,
Sterjiades
R
,
Chroboczek
J
,
Thibault
P
,
Gagnon
J
, et al
.
Complete amino acid sequence of Proteus mirabilis PR catalase. Occurrence of a methionine sulfone in the close proximity of the active site
.
J Protein Chem
1995
;
14
:
59
72
.
69.
Lin
L
,
Jia
L
,
Fu
Y
,
Zhao
R
,
Huang
Y
,
Tang
C
, et al
.
A comparative analysis of infection in patients with malignant cancer: a clinical pharmacist consultation study
.
J Infect Public Health
2019
;
12
:
789
93
.
70.
Ishii
T
,
Udono
H
,
Yamano
T
,
Ohta
H
,
Uenaka
A
,
Ono
T
, et al
.
Isolation of MHC class I—restricted tumor antigen peptide and its precursors associated with heat shock proteins hsp70, hsp90, and gp96
.
J Immunol
1999
;
162
:
1303
9
.
71.
Bloch
O
,
Lim
M
,
Sughrue
ME
,
Komotar
RJ
,
Abrahams
JM
,
O'Rourke
DM
, et al
.
Autologous heat shock protein peptide vaccination for newly diagnosed glioblastoma: impact of peripheral PD-L1 expression on response to therapy
.
Clin Cancer Res
2017
;
23
:
3575
84
.
72.
Binder
RJ
.
Immunosurveillance of cancer and the heat shock protein-CD91 pathway
.
Cell Immunol
2019
;
343
:
103814
.
This open access article is distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Supplementary data