Abstract
Anticancer immunotherapies demand optimal epitope targets, which could include proteasome-generated spliced peptides if tumor cells were to present them. Here, we show that spliced peptides are widely presented by MHC class I molecules of colon and breast carcinoma cell lines. The peptides derive from hot spots within antigens and enlarge the antigen coverage. Spliced peptides also represent a large number of antigens that would otherwise be neglected by patrolling T cells. These antigens tend to be long, hydrophobic, and basic. Thus, spliced peptides can be a key to identifying targets in an enlarged pool of antigens associated with cancer.
Introduction
Adoptive T-cell therapy (ATT) uses CD8+ T lymphocytes to selectively recognize and eliminate cancer cells. Ideal markers for cancer cell recognition are epitopes either carrying cancer-driver somatic mutations or presenting cancer-germline antigens. The number of these epitopes is, however, limited by the number of known cancer-germline genes, by the mutation frequency, and by the fact that epitopes need to have specific motifs to be presented on major histocompatibility complex class I (MHC-I) molecules and to pass all steps of the antigen presentation pathway. The identification of targetable tumor-specific epitopes is therefore one of the most challenging and yet promising quests in anticancer immunotherapies (1, 2).
MHC-I–bound epitopes are generally produced by the proteasome, which can break proteins and release peptide fragments or re-ligate them in a process called proteasome-catalyzed peptide splicing (PCPS; ref. 3). PCPS forms new (spliced) peptides with sequences that do not recapitulate the parental protein and is driven and regulated by factors that have been only partially determined (4–8). Despite our limited knowledge about PCPS, spliced peptides constitute a large portion of the antigenic peptide pool for human EBV-transformed cell lines and primary fibroblasts (9) and can be presented by MHC-I complexes in amounts comparable to canonical nonspliced peptides (9, 10). As the nonspliced epitopes, proteasome-generated spliced epitopes can trigger in vivo CD8+ T-cell–mediated responses toward tumor-associated or pathogen-derived antigens (10–12) and can be targets for effective anticancer ATT (13, 14).
At which magnitude could targeting spliced epitopes be an opportunity for ATTs is an open question with relevance for translational medicine. On the one hand, the theoretically large variety of spliced peptide sequences suggests that recurrent driver mutations, which could not be efficiently presented on predominant MHC-I variants by nonspliced peptides because of sequence limitations, could be, conversely, presented by spliced peptides (15). The preliminary observation that a large portion of antigens is represented at the surface of nontumor cells only by spliced peptides (9) suggests that PCPS could permit the presentation by MHC-I molecules of overlooked tumor-associated antigens. On the other hand, the few examples of CD8+ T cells specific for spliced epitopes described so far—derived from tumor-associated antigens (4, 7, 10, 13, 14, 16)—might call into question the relevance of PCPS in generating a large number of tumor-associated epitopes. Although this question could have been answered already by studying the antigenic peptides bound to the MHC-I molecules—i.e., the MHC-I immunopeptidome—of cancer cell lines, technical difficulties inherent in the PCPS itself hindered that approach (15). These difficulties have been resolved through development of a strategy for the identification of spliced peptides in the MHC-I immunopeptidome by mass spectrometry (MS; ref. 9). With that method, we could identify a portion of the immunopeptidome, which consisted of spliced peptides generated by proteasomal binding of two peptide fragments derived from the same molecule by cis PCPS (Supplementary Fig. S1) and separated in the antigen by no more than 20 residues (9). That study, which found that an unexpectedly large frequency of spliced peptides are available for T-cell recognition, demonstrated promising aid for the identification of novel targets for anticancer immunotherapy. Here, we have developed a novel method for the identification of spliced peptides in the immunopeptidome, and we have applied that approach to cancer cells and proved that PCPS enlarges the antigenic landscape in cancer cells.
Materials and Methods
Further details of the methods and the explanations of the outcomes are described in the Supplementary Experimental Procedures section.
Cell lines
The HCT116 and HCC1143 cell lines are derived from colon or breast carcinoma, respectively (Supplementary Table S1). HCC1143 have been grown in RPMI medium with 10% FCS, 2 mmol/L glutamine and penicillin and streptomycin, 1×MEM, 1×NaPyruvat, in 5% CO2 atmosphere at 37°C. They have been purchased from ATCC one year prior the use; they have been tested for mycoplasma and have not been reauthenticated.
Peptide synthesis and proteasome purification.
The polypeptide substrates have been synthesized using Fmoc solid phase chemistry. The sequence enumeration for the substrate polypeptides is reported in Supplementary Table S2. The mutated peptides (neoepitopes) identified in the MHC-I immunopeptidome of the HCT116 cell line and in the in vitro digestions of the synthetic substrates by purified proteasome are reported in Supplementary Table S3.
20S proteasome has been purified from the peripheral blood of a healthy donor, as previously described (10). Proteasome concentration has been measured by Bradford staining and verified by Coomassie staining in an SDS-PAGE gel. The purity of standardized proteasome preparations has been previously shown (17).
In vitro digestions and MS analysis.
Synthetic polypeptides (20 μmol/L) have been digested by 3 μg 20S proteasome in 100 μL TEAD buffer for 20 hours at 37°C as previously described (17). In vitro digestion samples have been measured by MS as follows: 10 μL digested sample has been concentrated for 5 minutes on a trap column (PepMap C18, 5 mm × 300 μm × 5 μm, particle size 100 Å, Thermo Fisher Scientific) with 2:98 (v/v) acetonitrile/water containing 0.1% (v/v) TFA at a flow rate of 20 μL/minute and then analyzed by nanoscale LC-MS/MS using an Ultimate 3000 and Q Exactive Plus mass spectrometer (Thermo Fisher Scientific). The system is composed of a 75 μm i.d. × 250 mm nano LC column (Acclaim PepMap C18, 2 μm; 100 Å; Thermo Fisher Scientific). The mobile phase (A) consisted of 0.1% (v/v) formic acid in water and (B) 80:20 (v/v) acetonitrile/water containing 0.1% (v/v) formic acid. The elution has been carried out using a gradient 3%–50% B in 30 minutes with a flow rate of 300 nL/minute. Full MS spectra (m/z 200–2000) have been acquired on a Q Exactive at a resolution of 70,000 (FWHM) followed by a data-dependent MS/MS of the top10 precursor ions (resolution 17,500, 4–8+ charge state excluded, 1 μscans). Fragment ions have been generated in a HCD cell and detected in an Orbitrap Mass Analyzer. Dynamic exclusion has been enabled with 30-s exclusion duration. The maximum ion injection time for MS scans has been set to 50 ms and for MS/MS scans to 80 ms. Background ions at m/z 391.2843 and 445.1200 have acted as lock mass. Peptides have been identified using the search engine Mascot version 2.6.1 (Matrix Science). The MS outcomes of the in vitro digestions of the synthetic substrates have been analyzed with the aim of identifying target peptides as previously described (18). Compared with the analysis method used for the analysis of the MHC-I immunopeptidomes, no restrictions for either the peptide product length or the intervening sequence length have been applied.
Extraction, processing, and analysis of proteins (>30 kDa) from the HCC1143 cell lysate.
HCC1143 cell pellet (3 × 106) has been lysed in 6 mol/L urea/2 mol/L thiourea in 10 mmol/L HEPES (pH 8.0) by repeated thawing and freezing. The samples have been centrifuged at 20,000 × g for 15 minutes at 4°C. Protein concentration in the supernatant has been quantified by BCA. Proteins larger than 30 kDa have been separated by NanoSep Centrifugal 30 kDa (Pall Life Sciences), centrifuging the sample for 15 minutes at 14,000 × g. Then, we diluted 15 μg protein in ABC buffer (50 mmol/L ammoniumbicarbonate, 6 mmol/L DTT, 5% ACN). The sample was then alkylated by the addition of iodoacetamide (12 mmol/L final concentration) and left in the dark at room temperature for 30 minutes. Proteins were digested by 0.3 μg LysC for 3 hours at room temperature, further diluted in ABC buffer, and digested by 0.3 μg trypsin overnight at room temperature. The sample was then purified by SepPak C18 (Waters) and eluted with a buffer 80% ACN 0.1% TFA and concentrated by speedvac. The digested sample (10 μL) was concentrated for 4 minutes on a trap column (PepMap C18, 5 mm × 300 μm × 5 μm, particle size 100 Å, Thermo Fisher Scientific) with 2:98 (v/v) acetonitrile/water containing 0.1% (v/v) TFA at a flow rate of 20 μL/minute and then analyzed by nanoscale LC-MS/MS using an Ultimate 3000 and Q Exactive Plus mass spectrometer (Thermo Fisher Scientific). The system was composed of a 75-μm i.d. × 250 mm nano LC column (Acclaim PepMap C18, 2 μm; 100 Å; Thermo Fisher Scientific). The mobile phase consisted of (A) 0.1% (v/v) formic acid in water and (B) 80:20 (v/v) acetonitrile/water containing 0.1% (v/v) formic acid. The elution was carried out using a gradient 3%–30% B in 85 minutes with a flow rate of 300 nL/minute. Full MS spectra (m/z 200–2000) were acquired on a Q Exactive Orbitrap at a resolution of 70,000 (FWHM) followed by a data-dependent MS/MS of the top10 precursor ions (resolution 17,500, 4–8+ charge state excluded, 1 μscans). Fragment ions were generated in an HCD cell and detected in an Orbitrap Mass Analyzer. Dynamic exclusion was enabled with 20-second exclusion duration. The maximum ion injection time for MS scans was set to 50 ms and for MS/MS scans to 2,000 ms. Background ions at m/z 391.2843 and 445.1200 acted as lock mass. Peptides were identified with the search engine Mascot version 2.6.1 (Matrix Science), applying the same method used for the MHC-I immunopeptidome analysis (described below).
MHC-I–peptide binding affinity.
The binding affinity of two neoepitopes identified in the HCT116 MHC-I immunopeptidome to the four MHC-I variants (HLA-A 191 01:01, -A02:01, -B45:01, -B 192 18:01) of the HCT116 cell line was measured using purified MHC-I molecules, as described elsewhere (9).
Identification of spliced and nonspliced peptides
The analysis of the MS data sets, generated from the immunopeptidome of the cell lines HCC1143 and HCT116 published by Bassani-Sternberg and colleagues (19), was carried out with Mascot version 2.6.1 (Matrix Science). MS/MS scans were searched with no enzyme specificity and 6 ppm peptide precursor mass tolerance, 20 ppm MS/MS mass tolerance and HCD fragmentation.
We computed for each protein entry in the human Swissprot database all 9- to 12-mer normal and reverse cis-spliced peptides with a maximum intervening sequence length of 25 residues (see Supplementary Fig. S1 for the PCPS nomenclature and Supplementary Fig. S2A and S2B for the peptide identification pipeline). All spliced sequences that could be generated by simple peptide-bond hydrolysis (i.e., nonspliced peptides) of any human protein were removed from the database. For each resulting spliced peptide, the molecular weight (MW) was computed. Similarly, all 9–12-mer nonspliced peptides and their MWs were computed. We then matched the observed precursor masses in the MS data with the MW of all theoretical spliced and nonspliced peptides that could be expected from an instrument with an accuracy of 6 ppm, and thereby reduced the overall database to a dataset-specific database. In order to generate a database that has a structure similar to that of the human proteome and to make our search strategy as similar as possible to previous studies, we transformed our spliced and nonspliced databases into the structure of the human proteome database. We concatenated the N-mer spliced peptide sequences to longer sequences, thereby generating new “protein” entries, which have a length distribution that followed that of the human proteome. For easier annotation, we concatenated spliced and nonspliced sequences in separate “proteins.” On the basis of this combined spliced and nonspliced database structure, we then computed the decoy database via randomization of the “protein” sequences, while ensuring that none of the target N-mer spliced and nonspliced sequences is present in the decoy database. All spectra were simultaneously searched against the spliced, nonspliced and decoy database using the Mascot search engine. The search results were extracted from Mascot and filtered using an ion score cutoff, which resulted in 1% false discovery rate (FDR). Merging short peptides into new “proteins” results in several database entries that are artificial junctions between the short peptides; these are neither spliced peptides nor nonspliced peptides. We have considered these database entries as decoy sequences, in case they matched an MS/MS spectrum.
The protocol described so far is identical with the analysis protocol described in Liepe and colleagues (9). In order to further increase the certainty in the identification of spliced peptide sequences, we have here introduced a minimum delta score (δ) of 0.3, which describes the relative deviation of a spliced peptide ion score (s1) from a spliced or nonspliced peptide ion score (s2): δ = 1 − s2/s1, where s1 < s2. Applying this delta score avoids the annotation of MS/MS spectra with spliced peptide sequences that are not certain because very similar (spliced or nonspliced) sequences could be almost equally well matched. If δ < 0.3 between two spliced peptide sequences, we did not annotate the MS/MS spectrum. If δ < 0.3 between a spliced and a nonspliced peptide, we considered the nonspliced peptide as the correct assignment (given there have been no other higher scored sequences for this MS/MS spectrum). By doing so, we have made the conservative assumption that a nonspliced peptide with a score of less than 30% difference to a spliced peptide score is more likely to be the correct assignment of this MS/MS spectrum.
Because it is almost impossible to distinguish leucine (L) and isoleucine (I) from each other, we have incorporated this uncertainty in our pipeline, by checking that a spliced peptide sequence carrying either L or I could not be explained by a nonspliced peptide sequence through exchange of I and L.
We introduced a last step in the pipeline by comparing the predicted MS-HPLC retention time (or more precisely the hydrophobicity) of the peptides with the MS-HPLC retention time of the MS/MS spectrum assigned to the corresponding peptide. The hydrophobicity of peptides can be predicted depending on the MS separation system and is correlated with the measured peptide retention times. We have made use of such a predictor—i.e., SSRCalc (20)—and predicted the hydrophobicity of all assigned nonspliced peptides. Assuming that the nonspliced peptide assignments are correct, we have computed the running average of the predicted hydrophobicity as a function of the measured nonspliced peptide retention time and detect the observed variance. We then applied this running average and variance to the spliced peptides. We removed all spliced peptide assignments that have predicted a hydrophobicity that shows a larger discrepancy to the running average of nonspliced peptides than the variance of the nonspliced peptides. This retention time filter is only applied to remove possibly wrong spliced peptide assignments. No nonspliced peptides were removed, regardless of their predicted hydrophobicity.
Peptide sequence assignment using semi-inverted databases
To estimate the possible false assignment rate of spliced and nonspliced peptide sequences in the MHC-I immunopeptidome, we generated artificial spliced peptide databases in which the sequence of either the N-terminal splice-reactant or the C-terminal splice-reactant was inverted. These semi-inverted databases contain almost the same number of sequences as the spliced human proteome database. Partially inverting sequences does not alter the MW of the sequence. Therefore, the m/z-matched semi-inverted databases contain as many entries as the m/z-matched spliced human proteome database. Because the latter two types of databases are constructed in similar ways, all database entries contain an N-mer long sequence that is identical to a peptide sequence found in the human proteome. If the identification of spliced peptides were due to an artifact appearing in the database construction and/or size, we would expect to identify similar numbers of semi-inverted peptide sequences and spliced peptides.
The semi-inverted databases were used to analyze one technical replicate (sample 20120617_EXQ0_MiBa_SA_HCT116_2_mHLA_2hr.raw) of the HCT116 immunopeptidome. The number of identified semi-inverted peptides was compared with the total number of peptides identified (sum of semi-inverted sequences and nonspliced peptides). Among the identified sequences, we checked which sequences could be spliced peptides with intervening sequences longer than 25 residues (which is the restriction we applied for the construction of the spliced human proteome database in our pipeline). In parallel, we adopted a similar strategy, in which we generated two databases where the sequences of either N-terminal or C-terminal portions of the nonspliced peptides were inverted. Because nonspliced peptides do not consist of two splice-reactants, of which we could invert one of the splice-reactants, we randomly sampled an artificial peptide splicing site based on a uniform distribution along the peptide sequence. Finally, we searched the same technical replicate of the HCT116 immunopeptidome against the nonspliced proteome database together with the semi-inverted nonspliced database and considered any assignment of sequences present in the semi-inverted databases as random.
Mutations, neoepitopes, and antigen presentation in the MHC-I immunopeptidome
To evaluate the prevalence of mutated antigens presented by MHC-I complexes, we calculated the number of expressed mutated and nonmutated antigens, which are represented by spliced peptides only, nonspliced peptides only, or both. The total number of expressed proteins is based on the data published by Klijn and colleagues (21) and the mutations identified by RNA-seq in independent studies (22). Further details are reported in Supplementary Experimental Procedures.
The spatial localization of the spliced and nonspliced peptides within the 3D structures of the antigens CHMP7 and RBBP7 was graphically presented using PyMol. The tertiary protein structure of CHMP7 and RBBP7 was predicted using i-Tasser (23).
Quantification
Quantification of the amount of spliced and nonspliced peptides in the immunopeptidomes by label-free MS was done as previously described (9). Briefly, we extracted the MS ion current peak area for each identified peptide (using Mascot Distiller's label-free quantification tools) and used this information to estimate the distribution of the amount of the spliced and nonspliced peptides. Potential bias of this method due to differences in the chemical features of spliced compared with nonspliced peptides has been previously excluded (9). Further details are reported in Supplementary Experimental Procedures.
Statistical analysis.
If not described otherwise, all statistical tests have been done in R, and differences in distributions have been tested using the Kolmogorov–Smirnov test. Where appropriate, P values have been adjusted with Bonferroni correction. We computed the odds ratio (OR) of mutated antigens being represented by peptides versus nonmutated antigens being represented by peptides by performing Fisher exact ratio test. In this latter statistical analysis, we distinguished between all antigens represented by any nonspliced peptide and antigens represented by any spliced peptide. Correlation analysis was conducted using Pearson correlation coefficient, where the test statistic follows a t distribution.
Dataset availability
The MHC-I immunopeptidome data sets have been obtained from the PRIDE archive (identifier: PXD000394; files: 20120321_ EXQ1_MiBa_SA_HCC1143_1.raw, 20120321_EXQ1_MiBa_SA_ HCC1143_2.raw, 20120322_EXQ1_MiBa_SA_HCC1143_1_A.raw, 20120515_EXQ3_MiBa_SA_HCT116_mHLA-1.raw, 20120515_ EXQ3_MiBa_SA_HCT116_mHLA-2.raw, 20120617_EXQ0_MiBa_ SA_HCT116_1_mHLA_2hr.raw, 20120617_EXQ0_MiBa_SA_ HCT116_2_mHLA_2hr.raw) or the Datadryad.org archive (doi:10.5061/dryad.r984n) and were generated by Bassani-Sternberg and colleagues (19) and Mommen and colleagues (24). The cell source characteristics are described in Supplementary Table S1. The RNA-seq data sets for the HCT116 and HCC1143 cell lines were obtained from Klijn and colleagues (21). The mutations' database for the HCC1143 and HCT116 refers to the Cosmic database (version August 17, 2016; ref. 22).
All other MS files (.mgf and/or .RAW) generated for the study, the peptide spectrum matches for the immunopeptidome data sets, and the mutation lists of the two cancer cell lines are available in the Mendeley database (http://dx.doi.org/10.17632/y2cvb5nvgn.1; see Supplementary Table S4).
Software and computing infrastructure.
All algorithms, spliced and nonspliced peptide databases, decoy databases and data analysis, and data plotting tools have been implemented in R on a Linux cluster system with 120 CPU-cores used for the construction of the entire human spliced peptide database (total data volume of database stored as binary RData files: 107 Gb) and for the construction of all dataset–specific databases (total data volume of database stored as binary RData files per MS RAW file: 45 Gb; total data volume of database stored as FASTA file per MS RAW file: 447 Gb; Mascot compiled databases results in approximately 1.5 Tb storage space needed per RAW file analysis).
The scripts for the MHC-I–spliced peptides' database generation are available in the Mendeley database http://dx.doi.org/10.17632/y2cvb5nvgn.1.
Commercial software.
MS RAW data were converted into Mascot generic file format using Mascot Distiller. Using the Mascot search engine (standard 1cpu license, uses 4 CPU-cores in parallel), the total search time per replicate of the HCC1143 and the HCT116 cell lines, respectively, was approximately 15 days (this varied depending on the dataset analyzed).
The 3D representation of the 2 antigens represented by neoepitopes was carried out using the software i-Tasser (23) to predict the tertiary protein structure, and by PyMOL for graphic visualization (The PyMOL Molecular Graphics System, Version 1.7.4 Schrödinger, LLC).
Results
SPI-delta method detects MHC-I–spliced immunopeptidome of cancer cell lines
To investigate the spliced immunopeptidome of cancer cells, we developed the SPI-delta (Spliced Peptide Identifier, version delta) method (Supplementary Fig. S2). In particular, we allowed, in the new human spliced proteome database, spliced 9–12-mer peptides generated with an intervening sequence between the splice reactants of 25 residues or less (see Supplementary Fig. S1 for PCPS nomenclature), allowing entries 5 residues longer than the previous version of the method (9). We introduced this modification because the study on MHC-I immunopeptidomes of nontumoral human cells did not show a prevalence of spliced peptides with short intervening sequences (9). We also introduced a minimum delta score to identify a peptide as spliced peptide. We included a final step in the pipeline to remove from the final annotation all spliced peptides that have an MS-HPLC retention time discordant to what is expected (see Materials and Methods for details and Supplementary Fig. S2).
We applied SPI-delta to the MHC-I immunopeptidomes of the HCT116 and HCC1143 cell lines. These cell lines were chosen because they derive from two of the most common and lethal tumors in the world (i.e., colon and breast cancers, respectively), and colorectal cancer can be cured by ATT by targeting neoepitopes (25).
Among the assigned sequences of the two cancer cell lines' immunopeptidomes, 1,230 peptides are spliced peptides, which account for 23.6% of the variety of the immunopeptidomes (Fig. 1A; Supplementary Table S5). We could search the immunopeptidome samples without considering cell-specific mutations detected in these two cancer cell lines (21, 22), and not allowing the identification of the most common posttranslational modification (PTM) of nonspliced peptides. In this case, the absolute number of both spliced and nonspliced peptide identifications, as well as the relative frequency of spliced peptides, is increased (Fig. 1A). Including PTMs in the MS data analysis increases the target and therefore the decoy database, which results in more stringent cutoffs for peptide identifications to ensure 1% FDR. As a consequence, fewer peptides are identified. Ignoring PTMs can result in the assignment of MS/MS spectra as spliced peptides, even though the better assignment would be a posttranslationally modified nonspliced peptide. This phenomenon explains the higher frequency of spliced peptides when we do not consider PTMs.
When investigating the spliced peptide quantity, we observed that, on average, the MHC-I–restricted spliced peptides are present in smaller amounts than nonspliced peptides: spliced peptides are represented by fewer molecules than nonspliced peptides, although they represent 19.3% and 19.6% of the bulk of peptide molecules detected in the HCT116 and HCC1143 immunopeptidomes, respectively (Fig. 1B).
Aspects of the label-free quantification method applied to estimate the spliced and nonspliced peptide amounts are further described in Supplementary Experimental Procedures and Supplementary Fig. S3A–S3G, where further general features of the spliced peptide pool are also reported.
Validation of the cancer cell immunopeptidome assignment
One of the concerns about spliced peptide identification in the immunopeptidome is the large size of the theoretical spliced peptide database, which might result in false sequence assignments despite the strict FDR of 1% and quality control steps in the SPI-delta pipeline. To test this hypothesis, we carried out two control experiments.
In the first experiment, we generated a proteasome-independent complex peptide mixture by LysC and trypsin degradation of the HCC1143 intracellular proteome. We analyzed this data set following the same protocol applied to the immunopeptidome, thereby using the same size of the spliced peptide database used for the immunopeptidome analysis. The sample has thousands of 9–12-mer peptides and an ion charge distribution similar to the immunopeptidomes (Fig. 1C–E). The sample somewhat mimics the cancer cell immunopeptidome data sets. Nonetheless, only 2.4% of peptides are annotated as spliced peptides (Fig. 1C).
In the second experiment, as a representative example, we considered the technical replicate of the HCT116 immunopeptidome in which we identified the largest number of peptides. We analyzed it using the spliced and nonspliced human proteome databases to match the peptide precursors in the data set (m/z matching). For those spliced peptide precursors that have been matched, we generated two databases, in which the sequence of all N-terminal or C-terminal splice-reactants has been inverted. To note, those semi-inverted databases each are approximately the size of the normal spliced peptide database generated to analyze this data set. The HCT116 immunopeptidome data set was then reanalyzed using the nonspliced proteome database together with either the inverted N-terminal or inverted C-terminal splice-reactant databases. Any assignment of sequences present in the semi-inverted databases is considered as randomly assigned. In parallel, we generated two databases where the sequences of all possible N-terminal or C-terminal portions (of randomly chosen length) of the nonspliced peptides have been inverted. We then searched the same technical replicate of the HCT116 immunopeptidome against the nonspliced proteome database together with the semi-inverted nonspliced database and considered any assignment of sequences present in the semi-inverted databases as randomly assigned. These additional four searches provide us with an estimation, in the same data set used for the MHC-I immunopeptidome identification, of the number of potentially wrongly assigned spliced peptides, depending on the database size. By searching against the semi-inverted spliced peptide databases, we assigned 4.2% C-terminally inverted spliced peptides and 3.6% N-terminally inverted spliced peptides to MS/MS spectra relative to the total number of peptides assigned (Fig. 1F). Some of the sequences present in the semi-inverted databases can, however, be cis-spliced peptides with the intervening sequence longer than 25 residues. If we do not consider these latter semi-inverted peptides, 3.0% of C-terminally inverted spliced peptides and 2.5% of N-terminally inverted spliced peptides are assigned. Furthermore, we assigned 0.2% of C-terminally inverted nonspliced peptides and 1.5% of N-terminally inverted nonspliced peptides to MS/MS spectra relative to the total number of peptides assigned (Fig. 1F).
These results show that the number of identified spliced peptides in the cancer immunopeptidomes is higher than that of the negative controls, confirming that the spliced peptide identification in the MHC-I immunopeptidome is not an artifact due to the spliced peptide database size or structure.
Comparison of sequence motifs in MHC-I immunopeptidomes
In the MHC-I immunopeptidomes of nontumoral cells, spliced and nonspliced peptides differ in terms of sequence motifs (9). This phenomenon could be due to the fact that PCPS seems to prefer sequence motifs in substrate polypeptides that are not those preferred for the normal peptide-bond hydrolysis (6). After their generation by the proteasome, peptides are subjected to sequential steps of the antigen presentation pathway that select sequence motifs (15). We would therefore expect that sequence motifs of the MHC-I–bound spliced and nonspliced peptides would cluster together because the downstream steps of the antigen presentation pathway are the same. We also expect mild differences in the sequence motifs within the clusters because of the different preferences for substrate sequence motifs of peptide hydrolysis and PCPS reactions.
Accordingly, we applied an in silico unsupervised approach to assign nonspliced peptides to the cancer cell lines' MHC-I variants (Supplementary Table S1). We identified four nonspliced peptide clusters based on their amino acid characteristics (Fig. 2; Supplementary Fig. S4). When we assigned each spliced peptide to the cluster possessing the most similar characteristics, we found that (i) spliced peptides could be clustered similarly to the nonspliced peptides in both cancer cells and significantly differently to randomly assigned sequences, (ii) the resulting clusters show similar cluster statistics (Fig. 2A; Supplementary Fig. S4A), and (iii) the relative distribution of spliced and nonspliced peptides in the different clusters is similar (Fig. 2B; Supplementary Fig. S4B). However, specific sequence differences emerge, despite the common overall characteristics of the grouped spliced and nonspliced peptides (see also Supplementary Experimental Procedures for additional details). In fact, the amino acid frequencies vary slightly between the same cluster of the spliced and nonspliced immunopeptidomes (Fig. 2B; Supplementary Fig. S4B), confirming our initial hypothesis. Within the same cluster, the frequency of the P1 splicing site (see Supplementary Fig. S1 for the nomenclature) does not always correspond to the intensity of the motif differences between spliced and nonspliced peptides. We find this unsurprising as residues around the P1 splicing site seem to influence the PCPS efficiency (8).
PCPS enlarges the antigenic landscape of cancer cell lines
PCPS enlarges the antigenic landscape of cancer cells not only in terms of peptide variety in the immunopeptidomes but also in terms of the number of antigens presented by MHC-I peptides. Indeed, almost 800 antigens identified in the MHC-I immunopeptidome of cancer cell lines are represented only by spliced peptides (Fig. 3A). These antigens could be targets for immunotherapies, if their expression is associated with the tumor or if they contain tumor-specific mutations.
The 99% and 92% of nonspliced and spliced peptides, respectively, bound to MHC-I molecules are assigned to antigens detected at transcriptional level (Fig. 3B; Supplementary Table S6; see also Supplementary Experimental Procedures for additional details). Among the 97 spliced peptides, which putatively derive from antigens not detectable in the transcriptome, 18 could derive from antigens detectable in the transcriptome if we allowed intervening sequences longer than 25 residues. Furthermore, all the remaining spliced peptides can also derive from antigens detectable in the transcriptome if we allowed PCPS between different antigens, a phenomenon called trans PCPS (3) that is excluded in our human spliced proteome database.
In terms of mutated antigens, 652 proteins that are identified at the RNA level in the HCT116 cell line (21) carry one or more missense mutations (22). Among them, 76% of the mutated antigens are not detected in the MHC-I immunopeptidome. The other 34%, on the contrary, is represented mainly by either spliced or nonspliced peptides, where the mutated antigens that are represented only by spliced peptides represent 5% of the mutated antigens' pool in the MHC-I immunopeptidome of the HCT116 cell line (Fig. 3C, left). Both spliced and nonspliced peptides more often represent mutated antigens (131 and 51 out of 695 mutated antigens are represented by nonspliced and spliced peptides, respectively, in the merged HCC1143 and HCT116 data set) than nonmutated antigens (2,705 and 830 out of 21,753 not mutated antigens are represented by nonspliced and spliced peptides, respectively, in the merged HCC1143 and HCT116 data set; spliced peptide OR = 2.0, P = 2.34 × 10−5; nonspliced peptides OR = 1.6, P = 1.77 × 10−6). However, among those mutated antigens that are represented in the immunopeptidomes, only two mutations are actually carried by (nonspliced) peptides that have been identified in our analysis (Fig. 3D and E; Supplementary Table S3). These two neoepitopes, so named even though their recognition by T cells remains to be proved, are CHMP7[A324T]316–325 and RBBP7[N17D]12–20. Both efficiently bind HLA-A*01:01 and HLA-B*18:01 molecules (Supplementary Table S3). Also, they can be generated by proteasome, as confirmed in in vitro digestions of the corresponding mutated antigenic polypeptide sequences carried out by the purified proteasome (Supplementary Fig. S5A and S5B; Supplementary Table S2; see also Supplementary Experimental Procedures). Transcription of both CHMP7 and RBBP7 antigens can be detected, as demonstrated in independent studies (21, 22). Both antigens are represented in the MHC-I immunopeptidome by other nonspliced peptides, which, however, do not carry the mutations (Fig. 3D and E).
The mutation load of the HCC1143 cell line is, on the contrary, much smaller (21, 22), and none of the mutations are carried by spliced or nonspliced peptides.
The fact that only 0.3% of the missense mutations (2 of 695) are represented in the cancer cell lines' MHC-I immunopeptidomes confirms that tumor-specific epitopes are rare. We speculate that their identification will also be facilitated by searching for spliced peptides, even though no tumor-specific spliced peptides have been detected in these cancer cell line MHC-I immunopeptidomes (possibly due to limited sample size).
Common features of antigens in the cancer MHC-I immunopeptidomes
Regarding the antigen features in the cancer cell lines' MHC-I immunopeptidomes, the number of spliced peptides per antigen correlates with both antigen length (Fig. 4A) and intracellular abundance (Fig. 4B), as shown for nonspliced peptides (Fig. 4A and B), and by others (19, 26, 27). Furthermore, the MHC-I sampling probability (D), which considers the antigen length and indicates the likelihood of an antigen to be represented by a spliced peptide at the cell surface, increases for both spliced and nonspliced peptides with increasing antigen abundance (Fig. 4C and D and ref. 19). For those antigens that are represented by both spliced and nonspliced peptides, the spliced peptides' D correlates with the nonspliced peptides' D (Fig. 4E).
As also observed for nonspliced peptides, the likelihood of an antigen being represented in these cancer cell lines by MHC-I–spliced peptide complexes is inversely correlated with the antigen half-life. This correlation has emerged by computing the fold over representation (D/D′), where D′ is the expected sampling probability, i.e., the average sampling probability of all antigens with the same abundance (see Supplementary Experimental Procedures for details). Indeed, the D/D′ ratio inversely correlates with the antigen half-life, independently of the half-life database used in the analysis (Fig. 4F).
Antigen features favoring antigen coverage by spliced peptides
To identify antigenic features that result in efficient spliced peptide presentation independently of the cell types studied, we generated an extended immunopeptidome data set by combining the data sets derived from the two cancer cell lines with that derived from the EBV-transformed lymphoblastoid cell line GR-LCL, which was generated by adopting a prefractioning of the peptide elution (2D strategy). The latter data set is the most informative we have, because of the large number of identified peptides and the validation of the identifications by comparison with synthetic peptides (9). We reanalyzed the GR-LCL 2D immunopeptidome data set by applying SPI-delta. As expected, we identified a smaller number of spliced and nonspliced peptides (Supplementary Table S5), thereby confirming that SPI-delta is more stringent than the previous version (9) and results in larger numbers of nonannotated MS/MS spectra.
From the extended MHC-I immunopeptidome data set, 1,096 antigens (almost 19% of all detected antigens) were identified that are represented by only spliced peptides (Fig. 5A). This extended immunopeptidome accounts for 11,655 unique peptides, of which 9,372 are nonspliced and 2,283 are spliced peptides, the latter representing around 20% of the whole immunopeptidome variety. From this extended immunopeptidome data set, we see that the presence of spliced peptides increases not only the number of antigens presented by MHC-I peptides but also the number of peptides presented per antigen (Fig. 5A and B).
With this extended data set, derived from 16 different MHC-I haplotypes (Supplementary Table S1), we can study the spatial distribution of spliced and nonspliced peptides by using a sliding window approach across the proteome and counting the number of observed peptides in each window (see Supplementary Experimental Procedures for details and Supplementary Fig. S6). We observed that spliced peptides cover a similar small fraction of the represented antigens compared with nonspliced peptides (Fig. 5C), although the presentation of both spliced and nonspliced peptides broadens the antigen coverage and the number of MHC-I–bound peptides per window (Fig. 5C and D). Furthermore, spliced peptides, like nonspliced peptides, cluster together in specific regions of the antigen, i.e., in “hotspots” (Fig. 5E). The observation that the coverage of spliced and nonspliced peptides together is smaller than the coverage of both peptide types individually (14% instead of 9.7% + 7.2% = 16.9%; Fig. 5D) indicates that they could be locally clustered together. Accordingly, the distances between spliced and nonspliced peptides are significantly smaller than the distances between randomly placed peptides (Fig. 5E and further details in Supplementary Experimental Procedures), hinting toward the existence of local antigenic regions prone to be represented by both spliced and nonspliced peptides.
The question, however, remains: why are some antigens represented exclusively by only spliced or only nonspliced peptides, and what characteristics differentiate them?
Antigens represented only by nonspliced peptides, for example, are significantly shorter than those presented only by spliced peptides or by both. More hydrophobic antigens are preferentially represented by spliced peptides than nonspliced peptides. Antigens represented only by nonspliced peptides show decreasing hydrophobicity with increasing antigen length. Conversely, antigens represented only by spliced peptides have generally higher hydrophobicity than those represented only by nonspliced peptides, independently of their length, and have decreasing hydrophobicity with the increase of the length, although only until a length of approximately 1,000 residues (Fig. 6A).
The average isoelectric point (IP) of antigens represented only by spliced peptides does not differ compared with those antigens represented only by nonspliced peptides or represented by both types of peptides (Fig. 6B). However, clear differences emerge when considering the whole trimodal IP distribution—for which the average is not representative—and computing the so-called IP bias (Fig. 6C and D; see Supplementary Experimental Procedures for detail analysis). This latter analysis suggests that spliced peptides are generated more efficiently from basic antigens than from acidic antigens.
In summary, length, hydrophobicity, and IP of an antigen are parameters that can determine whether an antigen is represented by MHC-I–spliced peptide complexes or not.
The two antigens from which we have identified nonspliced peptides carrying a tumor-specific mutation, CHMP7 and RBBP7, have characteristics that favor their representation through nonspliced peptides only. They are relatively short (453 and 425 amino acids, respectively), are rather hydrophilic (hydrophobicity index of −0.48 and −0.53, respectively) and acidic (IP of 4.99 and 4.68, respectively), all characteristics disfavoring the representation by spliced peptides. Indeed, we did not detect any spliced peptide representing these two antigens.
Discussion
Despite the limited knowledge about PCPS, identification of CD8+ T cells specific for spliced epitopes and able to reduce tumor growth (13, 14) hints at the value of spliced epitopes as targets for anticancer ATTs. This hypothesis is now supported by our demonstration here that in the breast and colon cancer cell lines, around 20% of the MHC-I immunopeptidome variety and quantity seems to be represented by spliced peptides.
This estimation depends on the technique and the statistics adopted. Indeed, the identification of spliced peptides in the immunopeptidome presents technical issues that need to be considered (15). To tackle the technical implications of the large spliced peptide database of the human proteome, we developed and applied in this study an amended identification pipeline (SPI-delta) that is stringent in terms of identification confidence. For example, from the analysis of the same four MS replicates of the MHC-I immunopeptidome of the HCT116, Bassani-Sternberg and colleagues (19) identified five nonspliced neoepitopes, whereas only two of them passed our pipeline. Our results show that the increase in stringency in our identification strategy has resulted in less sensitivity for the identification of nonspliced and spliced peptides.
Another aspect to consider is that the success in peptide identification correlates, for both spliced and nonspliced peptides, with the number of replicates analyzed. However, we found spliced peptides were less common in the cancer cell line immunopeptidomes than nonspliced peptides, as we previously observed in noncancer immunopeptidomes (9). Therefore, the correlation is stronger for spliced peptides.
Despite our stringent pipeline, our two control experiments indicate that we still have an experimental FDR for spliced peptide identification of about 2% to 4%, and for nonspliced peptides of about 1%. However, none of our controls are completely free of spliced peptides. Indeed, LysC and trypsin can also catalyze peptide splicing (3, 28). Furthermore, several among the peptides assigned as spliced peptides using the inverted splice-reactant database could be the outcome of PCPS reaction between noncontiguous peptides either with intervening sequences longer than 25 residues or derived from distinct antigens (i.e., trans-spliced peptides). Thus, we cannot exclude the possibility of false identification for some of the peptides assigned as spliced peptides in our control experiments.
The same concept is applicable to the group of spliced peptides that, according to our mapping, are derived from antigens not detected as transcripts: several of them could be derived from antigens detected in the cancer cell transcriptome if we allowed intervening sequences longer than 25 residues and all of them could be the product of trans PCPS involving two antigens detected in the cancer cell transcriptome.
The exclusion of spliced peptides derived from either trans PCPS or cis PCPS with long intervening sequences, which was based on a preliminary study on one spliced epitope (29), could be misleading in the analysis of the entire MHC-I–spliced immunopeptidome, as suggested by Faridi and colleagues (30). For example, we do not observe a correlation between the number of unique spliced peptides and their intervening sequence length. This restriction could be solved by adopting a de novo sequencing strategy in the identification pipeline, for instance, as had been done by Faridi and colleagues (30).
The general picture that emerges from our study points out that a large portion of the MHC-I immunopeptidome is populated by spliced peptides in cancer cell lines. Particularly relevant for anticancer immunotherapy could be the fact that PCPS allows representation of antigens that otherwise would be overlooked. For example, around one fourth of mutated antigens represented in the immunopeptidome of colon cancer cell line are represented only by spliced peptides, which could be relevant when searching for target neoepitopes and neoantigens suitable for ATTs. The antigens that are represented at the cell surface by spliced peptides are preferentially long, hydrophobic, and basic. Why those antigens have those characteristics warrants further investigation. However, we speculate that short antigens are less likely to produce spliced peptides simply due to fewer combinatorial possibilities. Furthermore, spliced peptides can be produced more easily if more hydrophobic amino acid residues are present in the antigen (8), because the C-terminal splice-reactant competes with a molecule of water for the nucleophilic attack to the acyl-enzyme intermediate (4). On the other hand, in nonspliced peptides less hydrophobic antigens are preferred, because the interaction with water molecules is needed for proteasomal peptide-bond hydrolysis. This means that the longer an antigen is, the more hydrophilic residues it needs to produce mainly nonspliced peptides. And, the longer an antigen is, the higher is the chance it would generate spliced peptides and the hydrophobicity could be progressively lower. There is, however, a hydrophobicity threshold, which is on average around 0.48 (hydrophobicity index) in our data set, below which antigen hydrophobicity would favor the production of nonspliced peptides. Therefore, once the antigens represented only by spliced peptides reach that threshold, they start to increase their average hydrophobicity.
In weighing the pros and cons of targeting spliced epitopes by anticancer ATTs, we shall consider that spliced peptides cluster similarly to nonspliced peptides with respect to amino acid characteristics. Some mild differences in the spliced and nonspliced sequence motifs are detectable in the cancer immunopeptidomes. On the one hand, this result confirms the hypothesis that spliced peptides follow the same antigen presentation pathway as nonspliced peptides and are selected by their affinity to the MHC-I cleft. On the other hand, it underlines that spliced peptides are different from nonspliced peptides. Indeed, PCPS is a very different process from peptide hydrolysis, which seems to follow different rules and to be driven by different factors (3, 6, 8). Only by understanding in detail those factors and dynamics can we predict spliced peptide generation and streamline our efforts by targeting those spliced (neo)epitope candidates that most likely are efficiently produced and presented.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: J. Liepe, M. Mishto
Development of methodology: J Liepe, J. Sidney, M. Mishto
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): J. Liepe, F.K.M. Lorenz, A. Sette, M. Mishto
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): J. Liepe, J. Sidney, F.K.M. Lorenz, M. Mishto
Writing, review, and/or revision of the manuscript: J. Liepe, J. Sidney, F.K.M. Lorenz, A. Sette, M. Mishto
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): J. Liepe, M. Mishto
Study supervision: J. Liepe, J. Sidney, M. Mishto
Acknowledgments
We thank K. Textoris-Taube and the Shared Facility Mass Spectrometry of the Charité for support in data acquisition and P. Henklein and the Peptide Synthesis Facility of the Charité for peptide synthesis. We thank D. Muharemagic for proofreading the manuscript and Prof. L. Smith for the useful discussion, which helped us develop SPI-delta. The study has in part been supported by NIH to A. Sette (R21Al134127), and by Cancer Research UK King's Health Partners Centre at King's College London (Development Fund 2018) to M. Mishto; the experiments reported in Supplementary Fig. S5 have been performed by M. Mishto while he was appointed at Charité, Universitätsmedizin Berlin. His contract was financially supported by the Berlin Institute of Health grant awarded to P.M. Kloetzel (BIH, CRG1-TP1).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.