Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor–encoding genes in a lymphocyte population. We hypothesized that this method could enable a novel approach to diagnose disease by identifying antigen receptor sequence patterns associated with clinical phenotypes. In this study, we developed statistical classifiers of T-cell receptor (TCR) repertoires that distinguish tumor tissue from patient-matched healthy tissue of the same organ. The basis of both classifiers was a biophysicochemical motif in the complementarity determining region 3 (CDR3) of TCRβ chains. To develop each classifier, we extracted 4-mers from every TCRβ CDR3 and represented each 4-mer using biophysicochemical features of its amino acid sequence combined with quantification of 4-mer (or receptor) abundance. This representation was scored using a logistic regression model. Unlike typical logistic regression, the classifier is fitted and validated under the requirement that at least 1 positively labeled 4-mer appears in every tumor repertoire and no positively labeled 4-mers appear in healthy tissue repertoires. We applied our method to publicly available data in which tumor and adjacent healthy tissue were collected from each patient. Using a patient-holdout cross-validation, our method achieved classification accuracy of 93% and 94% for colorectal and breast cancer, respectively. The parameter values for each classifier revealed distinct biophysicochemical properties for tumor-associated 4-mers within each cancer type. We propose that such motifs might be used to develop novel immune-based cancer screening assays.

Significance:

This study presents a novel computational approach to identify T-cell repertoire differences between normal and tumor tissue.

See related commentary by Zoete and Coukos, p. 1299

The immune system actively responds to solid tumors, resulting in tumor-infiltrating lymphocytes (TIL). Natural immune control is often unsuccessful, however, because the tumor microenvironment contains a mix of immune-activating and immune-suppressing signals (1). But given the right environmental cues, cytotoxic T lymphocytes in the tumor have the capacity to mediate tumor cell killing in virtue of bearing antigen receptors, T-cell receptors (TCR), with specificity for tumor-associated antigens (1, 2). Although there is tremendous heterogeneity between patients' antigen landscapes due to patient-specific tumor neoantigens, there is also overlap (2, 3). Therefore, we reasoned that patients with the same cancer type or subtype may have cytotoxic T-cell responses against a common set of antigens. Indeed, there is evidence for shared immunoreactivity, as well as for shared TCR sequences (4–10). We further reasoned that if these T-cell responses could be detected, particularly early in the disease course, they could serve as an important addition to the suite of methods under development for the early detection of cancer. As a first step in this direction, we designed this study to determine whether antitumor T-cell responses have a cancer-specific signature that can reliably distinguish cancer-associated repertoires from those associated with healthy tissue of the same organ.

We leveraged publicly available TCR deep sequencing data and the Multiple Instance Learning (MIL) machine-learning framework. The genes encoding TCRs are somatically generated through a process that creates essentially unique gene sequences at the relevant loci (11). This results in a tremendously diverse TCR repertoire, in which each TCR has its own distinct profile of antigens it can bind. Immune repertoire deep sequencing has made it possible to comprehensively profile the TCRs of a lymphocyte population and has been widely applied to TILs (12). The technology has enabled novel approaches for diagnosing and prognosticating diseases with a driving immune component by identifying repertoire patterns associated with clinical phenotypes. Most studies have been purely descriptive and looked for shared amino acid sequences among patients with a common phenotype (7, 8), looked for clusters of sequences overrepresented in one phenotype relative to another (13), or compared repertoire-level summary statistics, such as diversity, between phenotypes (reviewed in refs. 12, 14, 15). In the latter case, these features have prognostic value for some cancers and therapies (16–18). We are aware of only a handful of studies developing predictive models (19–23). With the exception of our study in multiple sclerosis (19), the studies were in the context of distinguishing infected from uninfected individuals or immunized from unimmunized or adjuvant-only immunized individuals where large, mono-, or oligoclonal lymphocyte expansion is expected.

Our approach relies on MIL, which provides a rigorous, established framework for relating immune repertoires to phenotypes. Most of the individual receptors in a person are unrelated to any specific phenotype and exist to maintain a diverse set of specificities as a contingency against any possible antigen. Only a small number of receptors are relevant to any specific phenotype. Mapping pertinent receptors in a repertoire to a single phenotype label can formally be described as MIL, which treats problems as bags of instances where the bags are labeled but the instances are not (24). The receptors from a single repertoire can be thought of as the instances, the repertoire as the bag, and the phenotype as the label. The goal is to predict the phenotype from the receptors.

In this study, we applied MIL to publicly available TCR deep sequencing data from tumor and healthy tissue from patients with colorectal or breast cancer (25, 26). The locus encoding the TCR β chain (TCRβ) was sequenced in all samples. In order to capture features of the TCRβs' antigen-binding capabilities, we represented each sequence using biophysicochemical features. We represented the somatically generated portion of each gene, which is the primary determinant of the antigen-binding specificity encoded by the gene. We then developed a statistical classifier for each cancer type and obtained classification accuracies by leave-one-out cross-validation of 93% and 94% for colorectal and breast cancer, respectively. Permutation analyses resulted in classification accuracies of 49% for both datasets. These results demonstrate distinct, biophysicochemical motifs in the TCRβ sequences of TILs that are specific to the cancer type and reliably distinguish cancer-associated repertoires from those associated with healthy tissue of the same organ.

Datasets

We used publicly available TCRβ deep sequencing data from 14 colorectal cancer patients (Table 1; ref. 26) and 16 breast cancer patients (Table 2; ref. 25). In both original studies, tumor and adjacent healthy tissue was biopsied from each patient, and genomic DNA was extracted and sent to Adaptive Biotechnologies for sequencing using a proprietary technology that provides accurate measurements of each receptor sequence's abundance (27). The sequences can be downloaded from the immuneACCESS database (https://doi.org/10.21417/B7PP46; https://doi.org/10.21417/B7NW5B; ref. 28).

Table 1.

Microsatellite instability status and the number of unique TCRβ CDR3 sequences from the tumor and healthy tissue samples are shown for the 14 colorectal cancer patients

Colorectal samples from ref. 26
TumorHealthy
Patient # (patient ID)MSI statusUnique TCRβsUnique TCRβs
1 (400464) MSS 1,836 1,466 
2 (400480) MSS 2,432 1,773 
3 (400488) MSI-H 2,090 699 
4 (400600) MSS 862 984 
5 (400712) MSS 203 667 
6 (400728) MSS 41 1,110 
7 (401144) MSS 1,390 1,040 
8 (401176) MSS 723 883 
9 (401248) MSS 391 1,844 
10 (401256) MSS 1,711 1,068 
11 (401264) MSS 3,849 910 
12 (401304) MSS 1,659 1,612 
13 (401320) MSS 2,933 1,667 
14 (401336) MSI-H 988 1,228 
Colorectal samples from ref. 26
TumorHealthy
Patient # (patient ID)MSI statusUnique TCRβsUnique TCRβs
1 (400464) MSS 1,836 1,466 
2 (400480) MSS 2,432 1,773 
3 (400488) MSI-H 2,090 699 
4 (400600) MSS 862 984 
5 (400712) MSS 203 667 
6 (400728) MSS 41 1,110 
7 (401144) MSS 1,390 1,040 
8 (401176) MSS 723 883 
9 (401248) MSS 391 1,844 
10 (401256) MSS 1,711 1,068 
11 (401264) MSS 3,849 910 
12 (401304) MSS 1,659 1,612 
13 (401320) MSS 2,933 1,667 
14 (401336) MSI-H 988 1,228 

Abbreviations: MSI-H, microsatellite instability detected at two or more markers; MSS, no microsatellite instabilities detected.

Table 2.

Breast cancer type, receptor status, and the number of unique TCRβ CDR3 sequences from the tumor and healthy tissue samples are shown for the 16 breast cancer patients

Breast samples from ref. 25
TumorHealthy
Patient # (patient ID)TypeER/PR/HER2Unique TCRβsUnique TCRβs
1 (BR01) IDC +/+/− 50,667 18,848 
2 (BR05) IDC +/+/− 21,559 7,923 
3 (BR07) IDC +/+/− 22,345 12,334 
4 (BR13) IDC +/+/− 8,276 2,609 
5 (BR14) ILC +/+/− 34,203 5,577 
6 (BR15) IDC +/+/− 16,341 3,316 
7 (BR16) IDC +/+/− 8,237 22,483 
8 (BR17) IDC +/+/− 8,686 7,748 
9 (BR18) IDC +/+/− 5,324 812 
10 (BR19) ILC +/+/− 8,571 8,865 
11 (BR20) ILC +/+/− 15,956 13,611 
12 (BR21) IDC +/+/− 18,597 10,593 
13 (BR22) IMC +/+/− 51,097 22,774 
14 (BR24) IDC −/−/− 45,953 10,903 
15 (BR25) IDC −/−/+ 16,004 4,276 
16 (BR26) ILC +/+/− 6,250 3,397 
Breast samples from ref. 25
TumorHealthy
Patient # (patient ID)TypeER/PR/HER2Unique TCRβsUnique TCRβs
1 (BR01) IDC +/+/− 50,667 18,848 
2 (BR05) IDC +/+/− 21,559 7,923 
3 (BR07) IDC +/+/− 22,345 12,334 
4 (BR13) IDC +/+/− 8,276 2,609 
5 (BR14) ILC +/+/− 34,203 5,577 
6 (BR15) IDC +/+/− 16,341 3,316 
7 (BR16) IDC +/+/− 8,237 22,483 
8 (BR17) IDC +/+/− 8,686 7,748 
9 (BR18) IDC +/+/− 5,324 812 
10 (BR19) ILC +/+/− 8,571 8,865 
11 (BR20) ILC +/+/− 15,956 13,611 
12 (BR21) IDC +/+/− 18,597 10,593 
13 (BR22) IMC +/+/− 51,097 22,774 
14 (BR24) IDC −/−/− 45,953 10,903 
15 (BR25) IDC −/−/+ 16,004 4,276 
16 (BR26) ILC +/+/− 6,250 3,397 

Abbreviations: ER, estrogen receptor; IDC, invasive ductal carcinoma; ILC, invasive lobular carcinoma; IMC, invasive mucinous carcinoma; PR, progesterone receptor.

Representing TCRs

We utilized a representation of TCRβ sequence that captures features relevant to its antigen-binding capabilities. We focused on complementarity determining region 3 (CDR3), because it is the somatically generated portion of the gene and the primary determinant of antigen-binding specificity. CDR3 residues that directly contact peptide in a peptide–MHC complex are expected to make the largest contribution to a TCR's antigen-binding specificity. To determine which TCRβ CDR3 residues contact peptide, we analyzed X-ray crystallographic structures of human TCRs bound to peptide–MHC complex (Fig. 1A) obtained from the Protein Data Bank (29). We extracted the TCRβ CDR3 sequence from each structure. After removing duplicates, 55 were left for analysis (Supplementary Table S1 and Supplementary Fig. S1). We annotated each TCRβ CDR3 residue as being in contact with peptide or not being in contact with peptide. Being in contact was defined as being ≤ 5Å from a peptide residue. We used the annotations to perform a multiple sequence alignment, aligning contact positions using clustalw (http://www.genome.jp/tools-bin/clustalw; Fig. 1B). The alignment revealed that TCRβ CDR3 residues in contact with peptide tend to lie adjacent to each other, forming a contiguous strip (Fig. 1A and B). The size and relative location of this strip varied, but the average length was four, and it rarely included any of the first or last three TCRβ CDR3 residues (Fig. 1B). For 23 of the structures, the strip was longer than four, and for 33, the strip was shorter than four. In ∼1/3 of the structures, an additional one or two residues were also in contact. To represent each TCRβ CDR3, we excluded the first and last three residues and partitioned the remaining sequence into every possible contiguous strip of four amino acids (4-mer; Fig. 1C). Thus, every CDR3 was represented by multiple 4-mers. Based on the alignment, we expect 4-mers to include most of the contact residues for the vast majority of TCRβ CDR3 sequences. Our expectation is that, for each TCRβ CDR3, at least one of its 4-mers contacts the peptide component of the receptor's cognate antigen.

Figure 1.

A, X-ray crystallographic structure of a human TCRβ chain (gray) bound to a peptide (blue) in complex with MHC (not shown). The CDR3 is shown in green, and the portion of the CDR3 in direct contact (≤5Å) with the peptide is shown in red. The MHC complex and α-chain are omitted for clarity. B, CDR3 sequences extracted from 55 X-ray crystallographic structures of human TCRs bound to peptide:MHC. Residues ≤5Å from peptide (red) are used to align the sequences. The alignment was created using clustalw. The bar chart shows the proportion of structures in which the corresponding CDR3 position was in direct contact with peptide. C, To profile the specificity of a CDR3 sequence, the CDR3 is cut into every possible 4-mer excluding the first and last three residues. D, Each 4-mer is converted into a biophysicochemical representation. For each residue, there are 5 Atchley factor values describing the residues biophysicochemical properties, resulting in a 4-mer representation consisting of 20 numeric values.

Figure 1.

A, X-ray crystallographic structure of a human TCRβ chain (gray) bound to a peptide (blue) in complex with MHC (not shown). The CDR3 is shown in green, and the portion of the CDR3 in direct contact (≤5Å) with the peptide is shown in red. The MHC complex and α-chain are omitted for clarity. B, CDR3 sequences extracted from 55 X-ray crystallographic structures of human TCRs bound to peptide:MHC. Residues ≤5Å from peptide (red) are used to align the sequences. The alignment was created using clustalw. The bar chart shows the proportion of structures in which the corresponding CDR3 position was in direct contact with peptide. C, To profile the specificity of a CDR3 sequence, the CDR3 is cut into every possible 4-mer excluding the first and last three residues. D, Each 4-mer is converted into a biophysicochemical representation. For each residue, there are 5 Atchley factor values describing the residues biophysicochemical properties, resulting in a 4-mer representation consisting of 20 numeric values.

Close modal

To identify 4-mers with different amino acid sequences but similar antigen-binding capabilities, we represented each 4-mer using numerical values for the biophysicochemical properties of its component amino acids. There are currently at least 566 amino acid indices one could choose from (https://www.genome.jp/aaindex/). Many are highly correlated and contain redundant information. At least two efforts have applied dimensionality reduction to large numbers of amino acid indices to derive small numbers of orthogonal properties (factors) that maintain most of the information contained in the original set. Kidera and colleagues derived 10 factors from 188 amino acid indices (30), and Atchley and colleagues derived 5 factors from 494 amino acid indices (31). We used Atchley factors, as they were derived from the largest number of indices and would require half as many model parameters as the Kidera factors. The five Atchley factors correspond loosely to polarity, secondary structure, molecular size/volume, codon diversity, and electrostatic charge. For input into our model, each amino acid in a 4-mer is represented by a vector of its five Atchley factor values (Fig. 1D).

T cells undergo clonal expansion in response to antigen stimulation. Thus, receptor quantity is an important feature for our statistical classifier. We used the logarithm of the 4-mers' relative abundances as a feature in the model. We considered two approaches. The first we refer to as calculating “the 4-mer relative abundance.” First, we identify every TCRβ sequence containing the 4-mer in its CDR3 and sum over its template counts, |{C^{{\rm{TCR{\rbeta}}}}}$|⁠.1

1We treat TCRβ sequences with identical CDR3 sequences as being the same TCRβ sequence, ignoring differences upstream of CDR3.

This provides the 4-mer count, |{C^{{\rm{4mer}}}}$|⁠, for the sample. We then divide by the total count of all 4-mers in the sample, |{T^{{\rm{4mer}}}}$|⁠, to get the 4-mer relative abundance, |RA$|⁠.

The second approach is to consider only the most abundant TCRβ sequence containing the 4-mer in its CDR3. This we refer to as calculating “the TCRβ relative abundance.” First, we sum over the TCRβ template counts, |{C^{{\rm{TCR{\rbeta}}}}}$|⁠, of every TCRβ in a sample to get the total count, |{T^{{\rm{TCR{\rbeta}}}}}$|⁠. We then divide |{C^{{\rm{TCR{\rbeta}}}}}$| for the most abundant TCRβ by |{T^{{\rm{TCR{\rbeta}}}}}$| to get the relative abundance, |R{A^{{\rm{TCR{\rbeta}}}}}$|⁠. We use the most abundant TCRβ, ignoring all less abundant TCRβs containing the 4-mer.

It is unclear a priori which approach is better, so we assessed the performance of our classifiers using both.

It is important to normalize the features of a classifier to be on the same scale. We normalized the Atchley factor values so that each has zero mean and unit variance. It was unclear a priori whether it would be appropriate to normalize the 4-mer abundance term, because its values are potentially unbounded. Therefore, we assessed classifier performance with and without normalizing this term.

Logistic regression model

The extracted 4-mers were scored using a logistic regression function that predicts whether a 4-mer is tumor-associated. We used this function because of its widespread use and simplicity, and because it models a binary dependent variable. First, a biased, weighted sum of the 4-mer features (the logit) is computed.

|{f_1}$| through |{f_{20}}$| represent the five Atchley factor values for the four 4-mer residues. |RA$| represents the 4-mer's relative abundance calculated using either Eq. A or B. The bias term |{b_0}$| and weights |{W_1}$| through |{W_{21}}$| are the model parameters and are fit by maximum likelihood using gradient optimization techniques (described below). The same weights |{W_1}$| through |{W_{21}}$| and bias term |{b_0}$| are used for all 4-mers. Once the logit is computed, the sigmoid function is applied to obtain a value between |0$| and |1$|⁠.

The score represents the probability that the 4-mer is tumor-associated.

Multiple instance learning

The problem of predicting repertoire-level labels from the 4-mers in each repertoire can be formally described as MIL in which the 4-mers are instances, the repertoires are bags, and the bag label is the tissue source of the repertoire (i.e., tumor or healthy; ref. 24). MIL relies on aggregating instance-level scores to assign a bag-level label. Thus, we need to aggregate the scores from all 4-mers in a repertoire into a single value that predicts whether the repertoire came from tumor or healthy tissue. Only a small number of 4-mers are expected to interact with relevant antigens. Accordingly, under the standard MIL assumption, at least one 4-mer per tumor repertoire must have a high score, whereas none from healthy tissue repertoires should have high scores. This was implemented by taking the maximum 4-mer score as the repertoire score. Thus, the probability that a repertoire came from tumor tissue given the 4-mer scores is defined as:

The predicted label is tumor when |\ge $| one 4-mer scores |\ge 0.5$|⁠, whereas the predicted label is healthy when every 4-mer scores |\le 0.5$|⁠. The model's parameter values were fit to maximize the assignment of correct labels.

Gradient optimization

Specific values for |{W_1}$| through |{W_{21}}$| and |{b_0}$| were determined using repertoires with known labels (i.e., tumor vs. healthy). The values were selected to maximize the likelihood that each prediction from Eq. E is correct. To search for optimal values, gradient optimization was used as in ref. 19. The initial values for |{b_0}$| and |{W_1}$| through |{W_{20}}$| were selected as in ref. 19. Two different protocols for initializing |{W_{21}}$| were tried (Table 3). We ran gradient optimization from 100,000 to 375,000 different initial weight values (Table 3). To save compute cycles, only models with good or better performance were run a large number of times.

Table 3.

Model variations considered for each cancer type

Calculation of RA (i.e., relative abundance)Normalization of “Log RA”Initial value W21 (the weight term on “Log RA”)MiscellaneousNumber of initializationsPatient holdout cross-validation
Colorectal cancer, HERWOOD et al. (26) 
    125,000 10/28 ≈ 36% 
Equation A Unnormalized W21 = 0  250,000 26/28 ≈ 93% 
Equation A Unnormalized W21 = 0 Batch normalization 125,000 16/28 ≈ 57% 
Equation A Unnormalized W21 = 0 Batch normalization and early stopping 125,000 19/28 ≈ 67% 
Equation A μ = 0, σ = 1 W21 ∼ N(0, 1/21)  250,000 18/28 ≈ 64% 
Equation B Unnormalized W21 = 0  375,000 21/28 ≈ 75% 
Equation B Unnormalized W21 = 0 Early stopping 375,000 23/28 ≈ 82% 
Equation B Unnormalized W21 ∼ N(0, 1/21)  125,000 18/28 ≈ 64% 
Breast cancer, BEAUSANG et al. (25) 
    250,000 21/32 ≈ 67% 
Equation A Unnormalized W21 = 0  125,000 14/32 ≈ 44% 
Equation A Unnormalized W21 = 0 Early stopping 125,000 23/32 ≈ 72% 
Equation A Unnormalized W21 = 0 Smaller step size 100,000 13/32 ≈ 41% 
Equation A Unnormalized W21 = 0 Smaller step size and early stopping 100,000 22/32 ≈ 69% 
Equation B Unnormalized W21 = 0  250,000 30/32 ≈ 94% 
Equation B Unnormalized W21 ∼ N(0, 1/21)  250,000 27/32 ≈ 84% 
Equation B Unnormalized W21 ∼ N(0, 1/21) Early stopping 250,000 28/32 ≈ 87% 
Calculation of RA (i.e., relative abundance)Normalization of “Log RA”Initial value W21 (the weight term on “Log RA”)MiscellaneousNumber of initializationsPatient holdout cross-validation
Colorectal cancer, HERWOOD et al. (26) 
    125,000 10/28 ≈ 36% 
Equation A Unnormalized W21 = 0  250,000 26/28 ≈ 93% 
Equation A Unnormalized W21 = 0 Batch normalization 125,000 16/28 ≈ 57% 
Equation A Unnormalized W21 = 0 Batch normalization and early stopping 125,000 19/28 ≈ 67% 
Equation A μ = 0, σ = 1 W21 ∼ N(0, 1/21)  250,000 18/28 ≈ 64% 
Equation B Unnormalized W21 = 0  375,000 21/28 ≈ 75% 
Equation B Unnormalized W21 = 0 Early stopping 375,000 23/28 ≈ 82% 
Equation B Unnormalized W21 ∼ N(0, 1/21)  125,000 18/28 ≈ 64% 
Breast cancer, BEAUSANG et al. (25) 
    250,000 21/32 ≈ 67% 
Equation A Unnormalized W21 = 0  125,000 14/32 ≈ 44% 
Equation A Unnormalized W21 = 0 Early stopping 125,000 23/32 ≈ 72% 
Equation A Unnormalized W21 = 0 Smaller step size 100,000 13/32 ≈ 41% 
Equation A Unnormalized W21 = 0 Smaller step size and early stopping 100,000 22/32 ≈ 69% 
Equation B Unnormalized W21 = 0  250,000 30/32 ≈ 94% 
Equation B Unnormalized W21 ∼ N(0, 1/21)  250,000 27/32 ≈ 84% 
Equation B Unnormalized W21 ∼ N(0, 1/21) Early stopping 250,000 28/32 ≈ 87% 

NOTE: First column, strategy for computing 4-mer relative abundance; 2nd column, approach to normalization of the relative abundance term; 3rd column, different schemes for initializing the weight W21; 4th column, other variations of the model that were considered; 5th column, the number of initializations that were run; 6th column, the performance of each variation of the model. The performances of the best-performing models are shown in bold font.

Overfitting is a concern with any statistical classifier, especially when using small amounts of labeled data. Because our approach uses the same weights for every 4-mer in each sample, our approach has fewer parameters than labeled data points (Supplementary Tables S2 and S3), which helps alleviate the concern of overfitting. Still, we applied early stopping to regularize the model, and we assessed model generalization using leave-one-out cross-validation (see below). We found that the best performing models generalize best to the holdout data on the last training step (Table 3), indicating that they had not begun to overfit the data. Previously, we applied L1/L2 regularization and dropout to the same models on a different disease and found that both worsened model performance (19). Therefore, we do not apply them here.

Model development and validation

We applied this approach to the colorectal and breast cancer datasets. Each was treated separately, resulting in one model for each dataset. To assess model performance, we performed patient-holdout cross-validation, where the tumor and patient-matched healthy sample for a single patient were simultaneously excluded during parameter fitting and then scored after selection of the best model (Fig. 2). The two samples were scored independently; the model had no knowledge that one was from a tumor and the other was from healthy tissue.

Figure 2.

Workflow for model selection and parameter fitting. The diagram shows how the data were used to train and validate each model. The performance of each model was assessed by a patient-holdout cross-validation, where the tumor and healthy samples from the same patient were excluded for validation. Data from the remaining N-1 patients were used to fit the model. For each model, between 100,000 and 375,000 initial sets of weights were generated. To save compute cycles, only models with good or better performance were run a large number of times. Each set of weights was used for exhaustive leave-one-out cross-validation over all N patients. Each run of cross-validation with each set of initial weights was run for 2,500 iterations of gradient optimization. The best fit to the N-1 training samples from among all runs was used to evaluate the excluded validation data.

Figure 2.

Workflow for model selection and parameter fitting. The diagram shows how the data were used to train and validate each model. The performance of each model was assessed by a patient-holdout cross-validation, where the tumor and healthy samples from the same patient were excluded for validation. Data from the remaining N-1 patients were used to fit the model. For each model, between 100,000 and 375,000 initial sets of weights were generated. To save compute cycles, only models with good or better performance were run a large number of times. Each set of weights was used for exhaustive leave-one-out cross-validation over all N patients. Each run of cross-validation with each set of initial weights was run for 2,500 iterations of gradient optimization. The best fit to the N-1 training samples from among all runs was used to evaluate the excluded validation data.

Close modal

Several variations of the model were considered, including different methods for calculating the relative abundance term, for normalizing the relative abundance term, and for initializing its weight |{W_{21}}$| (Table 3). Results for the best performing models are described below.

Colorectal cancer

The number of 4-mers per sample in the colorectal cancer dataset ranged from 186 to 7,112, with an average of 3,789, giving ∼ 79,566 features per sample (Supplementary Table S2). The best model used 4-mer (rather than TCRβ) relative abundance with this term unnormalized and its weight (⁠|{W_{21}}$|⁠) initialized to 0 (Table 3). This model correctly categorized 93% (26/28) of held-out samples with an average log-likelihood of −0.316 bits. The model always scored the tumor sample above the healthy sample despite having no knowledge that the two samples were from the same patient (Fig. 3A). To estimate the probability of correctly classifying 26 of 28 samples by chance, we performed a permutation analysis. For each permutation, a patient-holdout cross-validation was performed where the labels on the training data were permuted but those on the holdouts were not (Supplementary Table S4). The classification accuracies of all 20 permutations were <93%, allowing us to assign P < 0.05 to the observed accuracy. The average accuracy over all permutations is 49%, and the average log-likelihood fit is −2.66 bits.

Figure 3.

Colorectal cancer results. A, Classification accuracy obtained by patient-holdout cross-validation, where the tumor and healthy tissue from the same patient are excluded for validation. B, Illustration of the classifier weights after fitting the model to all 14 patients. For each of the five Atchley factors, the weights are shown for the four residue positions. The weight for the log-frequency of the 4-mer is also shown. Positive weight values are shown pointing up, and negative weight values are shown pointing down. The length of the arrow corresponds to the weight's magnitude. C, All 4-mers with a score above 0.5 (middle column) shown for each of the 14 patients (leftmost column). Each 4-mer is shown in the context of its respective CDR3. When the 4-mer appears in multiple CDR3 sequences, the CDR3 with the largest relative abundance is shown. The CDR3 sequences are ranked according to their relative abundance in the sample (rightmost column). A rank of 1 indicates the largest relative abundance in the sample. In patient 6, there are two CDR3s that each have two high-scoring 4-mers. MYRE and YREV are both found in the TCRβ CDR3 sequence CASSMYREVEAFF, and the 4-mers ERFY and RERF are both found in the TCRβ CDR3 sequence CASSRERFYEQYF.

Figure 3.

Colorectal cancer results. A, Classification accuracy obtained by patient-holdout cross-validation, where the tumor and healthy tissue from the same patient are excluded for validation. B, Illustration of the classifier weights after fitting the model to all 14 patients. For each of the five Atchley factors, the weights are shown for the four residue positions. The weight for the log-frequency of the 4-mer is also shown. Positive weight values are shown pointing up, and negative weight values are shown pointing down. The length of the arrow corresponds to the weight's magnitude. C, All 4-mers with a score above 0.5 (middle column) shown for each of the 14 patients (leftmost column). Each 4-mer is shown in the context of its respective CDR3. When the 4-mer appears in multiple CDR3 sequences, the CDR3 with the largest relative abundance is shown. The CDR3 sequences are ranked according to their relative abundance in the sample (rightmost column). A rank of 1 indicates the largest relative abundance in the sample. In patient 6, there are two CDR3s that each have two high-scoring 4-mers. MYRE and YREV are both found in the TCRβ CDR3 sequence CASSMYREVEAFF, and the 4-mers ERFY and RERF are both found in the TCRβ CDR3 sequence CASSRERFYEQYF.

Close modal

To discern the 4-mer biophysicochemical features that increase the probability of a tumor categorization, we examined the model weights with parameters fit on all 14 patients. The weights reveal how each Atchley factor contributes to the score and the relative importance of each 4-mer position (Fig. 3B). We observe negative weights for almost every 4-mer position for Atchley factors II and IV, indicating an increased probability that a sample is tumor-derived if it contains 4-mers comprising residues with a propensity to participate in α-helical segments and that appear infrequently among the space of all protein sequences. We also observe only positive weights for Atchley factor V, indicating an increased probability that a sample is tumor-derived if it contains 4-mers enriched with positively charged residues. The weight on the abundance term favors 4-mers with a large relative abundance. Analysis of the weights for each holdout model reveals that the model weights are highly consistent across holdout patients and with the weights fit on all 28 samples (Supplementary Fig. S2 and Supplementary Table S5).

The high-scoring 4-mers from each holdout patient also scored high with the model fit to all 14 patients. We aligned all 4-mers that scored highly enough to categorize a sample as being tumor-associated and found that the amino acids vary considerably at each 4-mer position (Fig. 3C). These 4-mers would not have been found by looking for shared amino acid sequences. The 4-mers are detected by our method, because they share similar biophysicochemical properties at key positions, as selected by the weights of the model. Several of the TCRβ CDR3 sequences correspond to large clones, but many of them do not (Fig. 3C). These CDR3 sequences would not have been found by examining only the most abundant clones.

Breast cancer

For the breast cancer dataset, the number of 4-mers ranged from 2,518 to 39,354 with an average of 20,261, giving ∼425,487 features per sample (Supplementary Table S3). The best performing model uses the TCRβ (rather than 4-mer) relative abundance (Table 3). It is otherwise like that for colorectal cancer (Table 3). The model correctly categorized 94% (30/32) of held-out samples with an average log-likelihood error of −0.283 bits (Fig. 4A). As with colorectal cancer, the model always scores the tumor sample above the patient-matched healthy sample (Fig. 4A). Permutation analysis gave a classification accuracy of 49% and an average log-likelihood fit of −2.71 bits (Supplementary Table S6). The classification accuracies of all 20 permutations were <94%, allowing us to assign P < 0.05 to the observed accuracy.

Figure 4.

Breast cancer results. A, Classification accuracy obtained by patient-holdout cross-validation, where the tumor and healthy tissue from the same patient are excluded for validation. B, Illustration of the classifier weights after fitting the model to all 16 patients. For each of the five Atchley factors, the weights are shown for the four residue positions. The weight for the log-frequency of the receptor is also shown. Positive weight values are shown pointing up, and negative weight values are shown pointing down. The length of the arrow corresponds to the weight's magnitude. C, All 4-mers with a score above 0.5 (middle column) shown for each of the 16 patients (leftmost column). Each 4-mer is shown in the context of its respective CDR3. When the 4-mer appears in multiple CDR3 sequences, the CDR3 with the largest relative abundance is shown. The CDR3 sequences are ranked according to their relative abundance in the sample (rightmost column). A rank of 1 indicates the largest relative abundance in the sample. As with colorectal cancer, we observed TCRβ CDR3 sequences containing multiple high-scoring 4-mers. In patient 1, LSRS and RSNQ appear in the TCRβ CDR3 sequence CASSLSRSNQPQHF. In patient 10, SSPH, AYNQ, and AAYN appear in the TCRβ CDR3 sequence CASSSPHRAAYNQPQHF.

Figure 4.

Breast cancer results. A, Classification accuracy obtained by patient-holdout cross-validation, where the tumor and healthy tissue from the same patient are excluded for validation. B, Illustration of the classifier weights after fitting the model to all 16 patients. For each of the five Atchley factors, the weights are shown for the four residue positions. The weight for the log-frequency of the receptor is also shown. Positive weight values are shown pointing up, and negative weight values are shown pointing down. The length of the arrow corresponds to the weight's magnitude. C, All 4-mers with a score above 0.5 (middle column) shown for each of the 16 patients (leftmost column). Each 4-mer is shown in the context of its respective CDR3. When the 4-mer appears in multiple CDR3 sequences, the CDR3 with the largest relative abundance is shown. The CDR3 sequences are ranked according to their relative abundance in the sample (rightmost column). A rank of 1 indicates the largest relative abundance in the sample. As with colorectal cancer, we observed TCRβ CDR3 sequences containing multiple high-scoring 4-mers. In patient 1, LSRS and RSNQ appear in the TCRβ CDR3 sequence CASSLSRSNQPQHF. In patient 10, SSPH, AYNQ, and AAYN appear in the TCRβ CDR3 sequence CASSSPHRAAYNQPQHF.

Close modal

We examined the model weights with parameters fit on all 16 patients (Fig. 4B). The direction and magnitude of the weights differed considerably from those obtained on the colorectal samples, indicating that the model is specific to cancer type. For all Atchley factors, the weights are position-dependent. For example, 4-mer scores are increased for 4-mers with hydrophobic residues at the first two positions and hydrophilic residues at the last two positions (Fig. 4B). The one similarity with the colorectal results is that the model assigns a high score to 4-mers with a high relative abundance. The weights of all models (holdout models and the model fit on all samples) cluster more tightly than for the colorectal cancer models (Supplementary Fig. S3 and Supplementary Table S7), possibly due to the much larger number of features per sample.

The high-scoring 4-mers from each holdout tumor sample also scored high with the model fit to all 16 patients except for the 4-mer GSYN, which is one of two high-scoring 4-mers for patient 3 during cross-validation. We aligned all 4-mers that scored high enough to categorize a sample as tumor and found, as with the colorectal cancer model, that the amino acids vary considerably at each 4-mer position (Fig. 4C). In contrast to what was observed for colorectal cancer, however, we observe that almost every TCRβ CDR3 sequence containing a high-scoring 4-mer corresponds to a top clone (Fig. 4C).

Improved methods for cancer early detection are urgently needed. For the vast majority of cancers, there is currently no test that is both sensitive enough to detect early stage disease and specific enough to mitigate overdiagnosis and overtreatment. For those cancers, screening of average-risk populations is not recommended, and cancer is typically detected only after it has progressed enough to cause symptoms. For the small number of cancers for which screening of average-risk populations is advised, there are still significant downsides. Current, guideline-endorsed approaches have sensitivities and specificities lower than ideal and therefore require frequent rescreening and have the potential for overdiagnosis and overtreatment. Furthermore, they primarily involve detecting abnormal tissue changes by imaging or cytology and require follow-up by invasive tissue collection, which has associated risks.

The holy grail of cancer detection is a highly specific, highly sensitive test that detects early stage disease and does not require invasive tissue collection. Many potential blood-borne biomarkers are under investigation, including protein markers and circulating cell-free tumor DNA (32). Some of them have also been found in other tissues accessible by minimally invasive procedures, such as cervical cytology samples (33). The most promising results have been obtained by assaying for combinations of markers (34). Although the results are promising, the sensitivities are highly variable, depend strongly on organ site, and are less promising for early stage disease (33–38). Thus, complementary biomarkers are needed.

Given the specificity of adaptive immune responses, lymphocyte recirculation, and the sensitivity of tests for detecting rare lymphocyte clones (39, 40), we find it plausible that antitumor lymphocyte responses could provide one such complementary biomarker. Success of the approach would require that antitumor lymphocyte responses have cancer-specific signatures, that the signatures appear early in the disease, and that they be detectable in readily accessible tissue. We designed the current study to address the first requirement by determining whether tumor-associated T-cell repertoires have a cancer-specific signature that can reliably distinguish cancer-associated TCR repertoires from those associated with healthy tissue of the same organ.

With the results presented here, we have successfully demonstrated that TIL repertoires include TCRs with cancer-specific, biophysicochemical motifs. Specifically, we detected distinct biophysicochemical motifs for colorectal cancer and breast cancer tumor-infiltrating T lymphocytes and demonstrated that these motifs distinguish tumor repertoires from healthy tissue repertoires with classification accuracies of 93% and 94%, respectively, by leave-one-out cross-validation. We further show by permutation analysis that the probability of obtaining these accuracies by chance is <0.05. These results suggest that the first requirement given above, that antitumor lymphocyte responses have cancer-specific signatures, could be met, at least for some cancer types. A definitive answer will require follow-up studies on larger patient cohorts and in additional cancer types.

As in our original approach, we represented each TCRβ's CDR3 by partitioning it into every possible contiguous strip of four amino acids (4-mer) and representing each 4-mer using the Atchley factor values for its component residues. We improved the approach, adding a feature that quantifies the relative abundances of the 4-mers. This feature is critical to model performance, increasing accuracy on the colorectal and breast cancer datasets from 36% to 93% and 67% to 94%, respectively (Table 3). Our approach is not merely identifying highly expanded T-cell clones, however. In the colorectal cancer dataset, the TCRβ CDR3 sequences containing high-scoring 4-mers correspond to a top-ten most abundant clone for only 4 of the 14 patients.

The biophysicochemical motifs detected by our method are specific to a cancer type. Four-mers with the following properties are classified as tumor-associated by the colorectal cancer model: hydrophilic residues in the 2nd and 3rd 4-mer positions with hydrophobic residues in the 1st and 4th positions; amino acids that tend to form α-helices at all four 4-mer positions; small residues in the 1st, 2nd, and 4th positions with large residues in the 3rd position; and positively charged residues at all four 4-mer positions. In contrast, 4-mers with the following properties are classified as tumor-associated by the breast cancer model: hydrophilic residues in the 3rd and 4th positions with hydrophobic residues in the 1st and 2nd positions; amino acids that tend to form α-helices in the 1st 4-mer position with amino acids that tend to form bends, coils, or turns in the remaining positions; large residues in the 1st, 2nd, and 4th 4-mer positions with small residues in the 3rd position; and negatively charged amino acids in the 2nd and 4th 4-mer positions with positively charged residues in the 1st and 3rd positions. Furthermore, both cancer motifs are quite different from the one reported for multiple sclerosis (19). Note that the Atchley factors have some moderate, interfactor correlation. Thus, two factors could have large weights at a particular position, but only one is important to antigen binding.

Finding biophysicochemical TCRβ motifs that, within a cancer type, are shared across the TILs repertoires of all patients but are largely absent from healthy tissue repertoires is consistent with the hypothesis that we are detecting TCR with specificity for a tumor-associated antigen that is shared between patients. We are not the first to hypothesize that patients may have shared TCRs responding to a common antigen, nor are we the first to look for the corresponding TCRs in colorectal and breast cancer TILs (7, 10, 25, 26, 41, 42). To address this hypothesis, other investigators have searched for CDR3 amino acid sequences that were shared between multiple patient TILs repertoires (7, 10, 25, 26, 41, 42). None of the studies found sequences shared across all patients in a study, and the degree of sharing varied tremendously. In the breast cancer studies, it was concluded that the shared TCRβ CDR3 sequences most likely correspond to public clones rather than receptors responding to common cancer antigens, because the same TCRβ CDR3 sequences were frequently found in databases of TCRβ repertoires from presumed healthy individuals (7, 25, 41, 42). In addition, Beausang and colleagues analyzed the sequence properties of the shared sequences and found that they had many features in common with the TCRβ CDR3 sequences of public clones, such as being shorter in length and having few insertions (25). The Munson and colleagues article stands out, because they sequenced transcripts from both TCR chains (7). They identified 14 TCR α and β pairs that were present in ≥7 of 20 patient TILs repertoires (including one present in 15 TILs repertoires) but not present in peripheral blood repertoires from six presumed healthy individuals (7).

Public clones are typically defined as TCRs with identical amino acid sequences observed across multiple individuals of a given species (43). In our case, the TCRβ CDR3s bearing high-scoring 4-mers have different amino acid sequences across patients, and therefore are not formally considered part of public TCRs. We conducted our analysis using biophysicochemical representations of amino acid sequence in order to find TCRβ CDR3s that could be expected to have similar antigen-binding capabilities, even in the absence of having identical amino acid sequences. Despite their similar biophysicochemical features, however, we cannot say that the TCRβ CDR3s we have identified are capable of binding to the same antigen. This needs to be tested experimentally.

The motifs we found are tumor-associated, suggesting that, if TCRs bearing the motif can bind the same antigen, the antigen is tumor-associated. The permutation analysis indicates that our method cannot find a shared biophysicochemical motif that uniquely distinguishes any random grouping of TCRβ repertoires. In addition, the motif we have identified in each case is unique to the TILs repertoires. This suggests that the motif is related to the fact that the T cells are found in the tumor. This may mean that they have specificity for a cancer antigen shared across the patients. There are alternative explanations, however. The T cells may be responding to tissue damage in the tumor, or they may be T regulatory cells contributing to immunosuppression. There are likely other interpretations. Again, experimental follow-up studies are needed.

It may seem surprising that our statistical classifier generalizes across patients with presumably different HLA genes. We expect this is because TCR:MHC interaction primarily happens via contacts in CDRs 1 and 2, whereas peptide contacts are primarily via CDR3 (44–47). We further expect that using a 4-mer (rather than a longer k-mer) allows us to isolate the residues of the TCRβ CDR3 that are responsible for peptide interaction. In addition, we speculate that having patient-matched healthy tissue is important, because it provides HLA-matched controls, enabling the model to achieve good performance in the absence of HLA information. It is quite possible, however, that the models' performances would improve if HLA type were included for each patient. This is something we will determine in future studies.

For both colorectal and breast cancers, we observed multiple TCRβ CDR3 sequences that contain multiple high-scoring 4-mers. If these 4-mers are the residues that contact peptide, then this suggests that the corresponding receptors may interact with cognate antigen via multiple 4-mers (Supplementary Fig. S4). The same receptor may bind the same peptide:MHC complex in multiple ways. The same receptor may bind the same peptide in the context of different MHC molecules using different 4-mers. Or the receptor may bind different peptides with the different 4-mers. In this case, we would expect the different peptides to be highly similar in terms of their biophysicochemical properties. In either case, the TCRβ CDR3 loop must exhibit considerable conformational flexibility. This kind of loop flexibility has been demonstrated (48, 49).

Our study has several limitations. First, our approach is designed to detect biophysicochemical motifs in TCRs with specificity for shared antigens. It cannot identify TCRs with specificity for patient-specific tumor neoantigens. Second, the number of crystal structures examined is quite small relative to the space of possible TCR–peptide–MHC interactions. Because we examined all Protein Data Bank structures from which we could extract the CDR3 amino acid sequence, increasing the number will require that additional crystal structures become available. Third, we used TCRβ sequences and thus have only part of the TCR antigen-binding site. Studies incorporating both TCRβ and TCRα are needed. It is not clear whether a model using both would find the same TCRβ biophysicochemical motifs together with a TCRα motif, or whether a novel motif would be detected. Fourth, we do not know the T lymphocyte subset to which the T cells bearing the TCRs with high-scoring 4-mers belong. Thus, we do not know their functional dispositions (e.g., cytotoxic, regulatory) or whether they could participate in tumor cell killing. Fifth, our patient sets are relatively small. Thus, follow-up studies on larger patient cohorts are needed. Finally, although we have successfully used our method on three different diseases and with both B-cell and T-cell receptors, our approach is not yet a turnkey solution for identifying biomarkers in immune repertoires. In each case, we considered several variations of the model to determine which one worked best for each disease. Each time we identify a new biomarker, however, we move one step closer to developing automated approaches that relate immune repertoires to labeled clinical phenotypes.

Our approach also has benefits. It requires only collecting tissue from a small number of patients, sequencing the immune receptors, and fitting the model. No prior knowledge of the antigens or receptor specificities is needed, and no additional experiments are required to enrich disease-specific receptors. The model's predictions are easily interpretable and may elucidate individual immune receptors associated with diseases such as cancer. Our future work will include improving our methodology and applying it to other diseases and tissue types.

No potential conflicts of interest were disclosed.

Conception and design: J. Ostmeyer, L.G. Cowell

Development of methodology: J. Ostmeyer, L.G. Cowell

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): J. Ostmeyer, S. Christley, I.T. Toby, L.G. Cowell

Writing, review, and/or revision of the manuscript: J. Ostmeyer, S. Christley, I.T. Toby, L.G. Cowell

Study supervision: L.G. Cowell

We are grateful that the colorectal and breast cancer datasets have been made available online, and we appreciate Adaptive Biotechnologies for hosting the data on immuneACCESS (https://clients.adaptivebiotech.com/immuneaccess). Computing time on the UT Southwestern BioHPC computing cluster was made available through the Harold C. Simmons Comprehensive Cancer Center.

This project was supported by a National Institute of Allergy and Infectious Diseases–funded R01 (AI097403) to L.G. Cowell, a training grant to the Simmons Comprehensive Cancer Center at UT Southwestern from the Cancer Prevention and Research Institute of Texas (RP160157), and funding to L.G. Cowell from UT Southwestern and the Simmons Comprehensive Cancer Center.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1.
Chen
DS
,
Mellman
I
. 
Elements of cancer immunity and the cancer-immune set point
.
Nature
2017
;
541
:
321
30
.
2.
Kvistborg
P
,
van Buuren
MM
,
Schumacher
TN
. 
Human cancer regression antigens
.
Curr Opin Immunol
2013
;
25
:
284
90
.
3.
Dhodapkar
K
,
Dhodapkar
M
. 
Harnessing shared antigens and T-cell receptors in cancer: opportunities and challenges
.
Proc Natl Acad Sci U S A
2016
;
113
:
7944
5
.
4.
Romero
P
,
Dunbar
PR
,
Valmori
D
,
Pittet
M
,
Ogg
GS
,
Rimoldi
D
, et al
Ex vivo staining of metastatic lymph nodes by class I major histocompatibility complex tetramers reveals high numbers of antigen-experienced tumor-specific cytolytic T lymphocytes
.
J Exp Med
1998
;
188
:
1641
50
.
5.
Dhodapkar
KM
,
Gettinger
SN
,
Das
R
,
Zebroski
H
,
Dhodapkar
MV
. 
SOX2-specific adaptive immunity and response to immunotherapy in non-small cell lung cancer
.
Oncoimmunology
2013
;
2
:
e25205
.
6.
Dhodapkar
MV
,
Sexton
R
,
Das
R
,
Dhodapkar
KM
,
Zhang
L
,
Sundaram
R
, et al
Prospective analysis of antigen-specific immunity, stem-cell antigens, and immune checkpoints in monoclonal gammopathy
.
Blood
2015
;
126
:
2475
8
.
7.
Munson
DJ
,
Egelston
CA
,
Chiotti
KE
,
Parra
ZE
,
Bruno
TC
,
Moore
BL
, et al
Identification of shared TCR sequences from T cells in human breast cancer using emulsion RT-PCR
.
Proc Natl Acad Sci U S A
2016
;
113
:
8272
7
.
8.
Massa
C
,
Robins
H
,
Desmarais
C
,
Riemann
D
,
Fahldieck
C
,
Fornara
P
, et al
Identification of patient-specific and tumor-shared T cell receptor sequences in renal cell carcinoma patients
.
Oncotarget
2017
;
8
:
21212
28
.
9.
Bai
X
,
Zhang
Q
,
Wu
S
,
Zhang
X
,
Wang
M
,
He
F
, et al
Characteristics of tumor infiltrating lymphocyte and circulating lymphocyte repertoires in pancreatic cancer by the sequencing of T cell receptors
.
Sci Rep
2015
;
5
:
13664
.
10.
Nakanishi
K
,
Kukita
Y
,
Segawa
H
,
Inoue
N
,
Ohue
M
,
Kato
K
. 
Characterization of the T-cell receptor beta chain repertoire in tumor-infiltrating lymphocytes
.
Cancer Med
2016
;
5
:
2513
21
.
11.
Fugmann
SD
,
Lee
AI
,
Shockett
PE
,
Villey
IJ
,
Schatz
DG
. 
The RAG proteins and V(D)J recombination: complexes, ends, and transposition
.
Annu Rev Immunol
2000
;
18
:
495
527
.
12.
Kirsch
I
,
Vignali
M
,
Robins
H
. 
T-cell receptor profiling in cancer
.
Mol Oncol
2015
;
9
:
2063
70
.
13.
Galson
JD
,
Trück
J
,
Fowler
A
,
Clutterbuck
EA
,
Münz
M
,
Cerundolo
V
, et al
Analysis of B cell repertoire dynamics following hepatitis B vaccination in humans, and enrichment of vaccine-specific antibody sequences
.
EBioMedicine
2015
;
2
:
2070
9
.
14.
Miho
E
,
Yermanos
A
,
Weber
CR
,
Berger
CT
,
Reddy
ST
,
Greiff
V
. 
Computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires
.
Front Immunol
2018
;
9
:
224
.
15.
Chaudhary
N
,
Wesemann
DR
. 
Analyzing immunoglobulin repertoires
.
Front Immunol
2018
;
9
:
462
.
16.
Jia
Q
,
Zhou
J
,
Chen
G
,
Shi
Y
,
Yu
H
,
Guan
P
, et al
Diversity index of mucosal resident T lymphocyte repertoire predicts clinical prognosis in gastric cancer
.
Oncoimmunology
2015
;
4
:
e1001230
.
17.
Postow
MA
,
Manuel
M
,
Wong
P
,
Yuan
J
,
Dong
Z
,
Liu
C
, et al
Peripheral T cell receptor diversity is associated with clinical outcomes following ipilimumab treatment in metastatic melanoma
.
J Immunother Cancer
2015
;
3
:
23
.
18.
Hosoi
A
,
Takeda
K
,
Nagaoka
K
,
Iino
T
,
Matsushita
H
,
Ueha
S
, et al
Increased diversity with reduced "diversity evenness" of tumor infiltrating T-cells for the successful cancer immunotherapy
.
Sci Rep
2018
;
8
:
1058
.
19.
Ostmeyer
J
,
Christley
S
,
Rounds
WH
,
Toby
I
,
Greenberg
BM
,
Monson
NL
, et al
Statistical classifiers for diagnosing disease from immune repertoires: a case study using multiple sclerosis
.
BMC Bioinformatics
2017
;
18
:
401
.
20.
Emerson
RO
,
DeWitt
WS
,
Vignali
M
,
Gravley
J
,
Hu
JK
,
Osborne
EJ
, et al
Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire
.
Nat Genet
2017
;
49
:
659
65
.
21.
Sun
Y
,
Best
K
,
Cinelli
M
,
Heather
JM
,
Reich-Zeliger
S
,
Shifrut
E
, et al
Specificity, privacy, and degeneracy in the CD4 T cell receptor repertoire following immunization
.
Front Immunol
2017
;
8
:
430
.
22.
Cinelli
M
,
Sun
Y
,
Best
K
,
Heather
JM
,
Reich-Zeliger
S
,
Shifrut
E
, et al
Feature selection using a one dimensional naive Bayes' classifier increases the accuracy of support vector machine classification of CDR3 repertoires
.
Bioinformatics
2017
;
33
:
951
5
.
23.
Thomas
N
,
Best
K
,
Cinelli
M
,
Reich-Zeliger
S
,
Gal
H
,
Shifrut
E
, et al
Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence
.
Bioinformatics
2014
;
30
:
3181
8
.
24.
Carbonneau
M-A
,
Cheplygina
V
,
Granger
E
,
Gagnon
G
. 
Multiple instance learning: a survey of problem characteristics and applications
.
Pattern Recognition
2018
;
77
:
329
53
.
25.
Beausang
JF
,
Wheeler
AJ
,
Chan
NH
,
Hanft
VR
,
Dirbas
FM
,
Jeffrey
SS
, et al
T cell receptor sequencing of early-stage breast cancer tumors identifies altered clonal structure of the T cell repertoire
.
Proc Natl Acad Sci U S A
2017
;
114
:
E10409
E10417
.
26.
Sherwood
AM
,
Emerson
RO
,
Scherer
D
,
Habermann
N
,
Buck
K
,
Staffa
J
, et al
Tumor-infiltrating lymphocytes in colorectal tumors display a diversity of T cell receptor sequences that differ from the T cells in adjacent mucosal tissue
.
Cancer Immunol Immunother
2013
;
62
:
1453
61
.
27.
Carlson
CS
,
Emerson
RO
,
Sherwood
AM
,
Desmarais
C
,
Chung
MW
,
Parsons
JM
, et al
Using synthetic templates to design an unbiased multiplex PCR assay
.
Nat Commun
2013
;
4
:
2680
.
28.
DeWitt
WS
,
Lindau
P
,
Snyder
TM
,
Sherwood
AM
,
Vignali
M
,
Carlson
CS
, et al
A public database of memory and Naive B-cell receptor sequences
.
PLoS One
2016
;
11
:
e0160853
.
29.
Rose
PW
,
Prlić
A
,
Bi
C
,
Bluhm
WF
,
Christie
CH
,
Dutta
S
, et al
The RCSB Protein Data Bank: views of structural biology for basic and applied research and education
.
Nucleic Acids Res
2015
;
43
:
D345
56
.
30.
Kidera
A
,
Konishi
Y
,
Oka
M
,
Ooi
T
,
Scheraga
HA
. 
Statistical-analysis of the physical-properties of the 20 naturally-occurring amino-acids
.
J Protein Chem
1985
;
4
:
23
55
.
31.
Atchley
WR
,
Zhao
J
,
Fernandes
AD
,
Drüke
T
. 
Solving the protein sequence metric problem
.
Proc Natl Acad Sci U S A
2005
;
102
:
6395
400
.
32.
Babayan
A
,
Pantel
K
. 
Advances in liquid biopsy approaches for early detection and monitoring of cancer
.
Genome Med
2018
;
10
:
21
.
33.
Kinde
I
,
Bettegowda
C
,
Wang
Y
,
Wu
J
,
Agrawal
N
,
Shih
I-M
, et al
Evaluation of DNA from the Papanicolaou test to detect ovarian and endometrial cancers
.
Sci Transl Med
2013
;
5
:
167ra4
.
34.
Cohen
JD
,
Li
L
,
Wang
Y
,
Thoburn
C
,
Afsari
B
,
Danilova
L
, et al
Detection and localization of surgically resectable cancers with a multi-analyte blood test
.
Science
2018
;
359
:
926
30
.
35.
Krimmel
JD
,
Schmitt
MW
,
Harrell
MI
,
Agnew
KJ
,
Kennedy
SR
,
Emond
MJ
, et al
Ultra-deep sequencing detects ovarian cancer cells in peritoneal fluid and reveals somatic TP53 mutations in noncancerous tissues
.
Proc Natl Acad Sci U S A
2016
;
113
:
6005
10
.
36.
Fernandez-Cuesta
L
,
Perdomo
S
,
Avogbe
PH
,
Leblay
N
,
Delhomme
TM
,
Gaborieau
V
, et al
Identification of circulating tumor DNA for the early detection of small-cell lung cancer
.
EBioMedicine
2016
;
10
:
117
23
.
37.
Bettegowda
C
,
Sausen
M
,
Leary
RJ
,
Kinde
I
,
Wang
Y
,
Agrawal
N
, et al
Detection of circulating tumor DNA in early- and late-stage human malignancies
.
Sci Transl Med
2014
;
6
:
224ra24
.
38.
Newman
AM
,
Bratman
SV
,
To
J
,
Wynne
JF
,
Eclov
NCW
,
Modlin
LA
, et al
An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage
.
Nat Med
2014
;
20
:
548
54
.
39.
Korde
N
,
Roschewski
M
,
Zingone
A
,
Kwok
M
,
Manasanch
EE
,
Bhutani
M
, et al
Treatment with carfilzomib-lenalidomide-dexamethasone with lenalidomide extension in patients with smoldering or newly diagnosed multiple myeloma
.
JAMA Oncol
2015
;
1
:
746
54
.
40.
Wu
D
,
Emerson
RO
,
Sherwood
A
,
Loh
ML
,
Angiolillo
A
,
Howie
B
, et al
Detection of minimal residual disease in B lymphoblastic leukemia by high-throughput sequencing of IGH
.
Clin Cancer Res
2014
;
20
:
4540
8
.
41.
Levy
E
,
Marty
R
,
Calderón
VG
,
Woo
B
,
Dow
M
,
Armisen
R
, et al
Immune DNA signature of T-cell infiltration in breast tumor exomes
.
Sci Rep
2016
;
6
:
30064
.
42.
Wang
T
,
Wang
C
,
Wu
J
,
He
C
,
Zhang
W
,
Liu
J
, et al
The different T-cell receptor repertoires in breast cancer tumors, draining lymph nodes, and adjacent tissues
.
Cancer Immunol Res
2017
;
5
:
148
56
.
43.
Venturi
V
,
Price
DA
,
Douek
DC
,
Davenport
MP
. 
The molecular basis for public T-cell responses?
Nat Rev Immunol
2008
;
8
:
231
8
.
44.
Garcia
KC
,
Adams
JJ
,
Feng
D
,
Ely
LK
. 
The molecular basis of TCR germline bias for MHC is surprisingly simple
.
Nat Immunol
2009
;
10
:
143
7
.
45.
Rossjohn
J
,
Gras
S
,
Miles
JJ
,
Turner
SJ
,
Godfrey
DI
,
McCluskey
J
. 
T cell antigen receptor recognition of antigen-presenting molecules
.
Annu Rev Immunol
2015
;
33
:
169
200
.
46.
Rudolph
MG
,
Stanfield
RL
,
Wilson
IA
. 
How TCRs bind MHCs, peptides, and coreceptors
.
Annu Rev Immunol
2006
;
24
:
419
66
.
47.
Zhang
H
,
Lim
HS
,
Knapp
B
,
Deane
CM
,
Aleksic
M
,
Dushek
O
, et al
The contribution of major histocompatibility complex contacts to the affinity and kinetics of T cell receptor binding
.
Sci Rep
2016
;
6
:
35326
.
48.
Reiser
JB
,
Grégoire
C
,
Darnault
C
,
Mosser
T
,
Guimezanes
A
,
Schmitt-Verhulst
AM
, et al
A T cell receptor CDR3beta loop undergoes conformational changes of unprecedented magnitude upon binding to a peptide/MHC class I complex
.
Immunity
2002
;
16
:
345
54
.
49.
Ayres
CM
,
Scott
DR
,
Corcelli
SA
,
Baker
BM
. 
Differential utilization of binding loop flexibility in T cell receptor ligand selection and cross-reactivity
.
Sci Rep
2016
;
6
:
25070
.