Abstract
Background: Serum proteomic biomarkers offer a promising approach for early detection of cancer. In this study, we aimed to identify proteomic profiles that could distinguish colon cancer cases from controls using serial prediagnostic serum samples.
Methods: This was a nested case–control study of active duty military members. Cases consisted of 264 patients diagnosed with colon cancer between 2001 and 2009. Controls were matched to cases on age, gender, race, serum sample count, and collection date. We identified peaks that discriminated cases from controls using random forest data analysis with a 2/3 training and 1/3 validation dataset. We then included epidemiologic data to see whether further improvement of model performance was obtainable. Proteins that corresponded to discriminatory peaks were identified.
Results: Peaks with m/z values of 3,119.32, 2,886.67, 2,939.23, and 5,078.81 were found to discriminate cases from controls with a sensitivity of 69% and a specificity of 67% in the year before diagnosis. When smoking status was included, sensitivity increased to 76% while histories of other cancer and tonsillectomy raised specificity to 76%. Peaks at 2,886.67 and 3,119.32 m/z were identified as histone acetyltransferases while 2,939.24 m/z was a transporting ATPase subunit.
Conclusions: Proteomic profiles in the year before cancer diagnosis have the potential to discriminate colon cancer patients from controls, and the addition of epidemiologic information may increase the sensitivity and specificity of discrimination.
Impact: Our findings indicate the potential value of using serum prediagnostic proteomic biomarkers in combination with epidemiologic data for early detection of colon cancer. Cancer Epidemiol Biomarkers Prev; 26(5); 711–8. ©2016 AACR.
Introduction
Proteomic profiling of prediagnostic serum samples offers a promising approach for early detection and prediction of cancer that has the potential to minimize expense and invasiveness of current cancer screening methods. As cancer develops from aberrant DNA mutations that change protein expression patterns through modification of protein structure and functions, proteomic profiling of serum may reflect the pathologic status of organs (1).
Many studies have been conducted to profile proteomic patterns for early detection of colorectal cancer using serum samples. In one case–control study, two protein peaks (13,732.4 and 13,912.3 m/z) classified colorectal cases from healthy controls (2). Another case–control study reported peak m/z values of 1,208, 1,467, 1,505, 1,618, and 4,215 that differentiated cases from controls with an accuracy close to 100% (3). Furthermore, one study using MALDI-TOF MS protein fractionation methods reported a sensitivity of 87% and a specificity of 85% using 4 peaks (2,870.7, 3,084.0, 9,180.5, and 13,748.8 m/z; ref. 4), while another study reported a sensitivity of 94.4% and a specificity of 75.5% using three peaks (1,778.97, 1,866.16, 1,934.65, and 2,022.46 m/z; ref. 5).
Beyond proteomic peaks, specific blood-derived proteins have also been identified as potential biomarkers in predicting colorectal cancer (6–11). Increased plasma levels of Apo Al (AUC = 0.621) and decreased levels of C9 (AUC = 0.730) were found in colorectal cancer patients compared with healthy ones (7). Another case–control study reported a decrease in fragments of complement C3f among serum samples of colorectal patients compared with healthy controls (5). Two immunoreactive antigens in serum, MAPKAPK3 and ACVR2B, successfully discriminated cases from controls with a sensitivity and specificity of 83.3% and 73.9%, respectively (12). In addition, a unique serum peptide and protein biomarker signature were reported in a large study of 126 colorectal cancer and 277 control serum samples. Results were validated among 50 cases and 82 controls (AUC = 0.93; ref. 8). Although most, but not all (13), studies identified proteomic peaks or specific proteins that distinguished colorectal cancer patients from controls, identified biomarkers were largely different among studies.
Inconsistencies in biomarker detection may be due to the timing of sample collection. Blood, serum, and plasma samples collected at clinical diagnosis or later may identify patterns that result from anxiety or stress hormones related to the cancer diagnosis rather than the products of cancer itself (14). Thus, samples collected prior to diagnosis are necessary for true early detection of cancer. It is possible that a colon cancer proteomic profile may appear a few years before the clinical detection of cancer where serial prediagnostic samples may aid in the earliest detection of a proteomic profile specific to colon cancer. In addition, sample collection, storage, and processing might differ between clinically diagnosed cancer patients and controls (11, 15, 16), which may influence the validity of results. Furthermore, many previous studies on cancer proteomics used only blood samples (3, 4). Demographic and epidemiologic characteristics may affect proteomic profiles and have the potential to improve detection and prediction of colon cancer.
The aim of this study was to identify serum protein profiles for detecting and predicting colon cancer using serial prediagnostic samples stored at the Department of Defense Serum Repository (DoDSR). Specifically, the study aimed to (i) investigate whether there are proteomic profiles in prediagnostic serum samples that discriminate colon cancer cases from controls; (ii) use serial serum samples from the same study subject to detect the earliest time point at which identified proteomic profiles appear and assess whether profiles vary with time from diagnosis of colon cancer; as well as (iii) evaluate whether there are epidemiologic factors that may improve proteomic discrimination of colon cancer patients from controls.
Materials and Methods
We conducted a case–control study nested within the U.S. military population under surveillance by the Armed Forces Health Surveillance Center (AFHSC). The AFHSC hosts a central repository of longitudinally collected medical data for the U.S. armed forces and the Department of Defense Serum Repository (DoDSR) that collects blood samples from all members approximately every other year (17). This study was approved by the Uniformed Services University of the Health Sciences and the Walter Reed National Military Medical Center Institutional Review Boards.
Study subjects
Cases consisted of men and women diagnosed with colon cancer between January 1, 2001 and December 31, 2009 who were between 17 and 79 years of age at diagnosis of colon cancer, active duty at time of diagnosis, and had at least one prediagnostic serum sample of ≥1 mL. Controls were men and women who did not have colon cancer, were active duty in the reference year (corresponding to diagnosis year of matched case), had at least one serum sample ≥1 mL, and were alive at the start of the study. They were matched to cases on age (within one year of birth), racial background (European-American, African-American, and Other), and time points at which serum samples were collected (within ± 60 days of requested case sample). All controls were randomly selected from the AFHSC database.
At the first step of study subject enrollment, AFHSC identified and provided a list of all eligible cases according to the 9th Revision International Classification of Diseases (ICD-9) coding system. Before contacting eligible cases, we verified survival status through the Defense Manpower Data Center (DMDC), which manages and updates all personnel information, as well as CDC's National Death Index and the Social Security Administration. An informative letter describing the study purpose and procedures, informed consent document, and questionnaire packet were mailed to all surviving cases. For each case who participated, five potential controls were identified using procedures similar to cases. If the person did not respond or did not want to participate, the next control was contacted until successful enrollment.
After excluding deceased cases and individuals without confirmed colon cancer, there were 431 eligible surviving cases, out of whom 234 participated and completed a questionnaire (54%). We then contacted 560 controls who were matched to the surviving case participants. Out of the contacted controls, 222 participated (40%) and 215 had a matched case. The remaining 19 cases that did not have a matched control were subsequently matched to randomly selected controls from AFHSC and did not have questionnaire data. In addition, 163 randomly selected controls were matched to the 163 deceased colon cancer cases without questionnaire data. Overall, there were 397 cases and 397 matched controls included in this study.
Data collection
Epidemiologic survey data.
For surviving cases and matched controls who participated in the study, self-administered questionnaires were utilized to collect epidemiologic data, which included information on demographic variables, personal habits, lifestyle characteristics, medical history, personal feelings and general health, family histories of cancer and colorectal polyps, reproductive and contraceptive histories (for women only), and dietary intake. When applicable, questions on exposures at the time before first blood draw for military duty (time 1) and in the year before cancer diagnosis for cases or reference date for controls (time 2) were asked. The reference date corresponds to the month in which the matched case was diagnosed. The mean length of time between the two time periods was 10.4 years, which did not significantly differ between cases and controls.
Medical records and surveillance data.
AFHSC surveillance data was collected for all case–control matched pairs who were included in serum analysis. We requested data on demographics, military occupation specialty, casualty events, diagnosed medical conditions, medical procedures performed, and vaccinations.
Cancer registry data.
The Automated Central Tumor Registry (ACTUR) of the Armed Forces collects data from medical records of cancer patients diagnosed or treated at military treatment facilities. The data contain information on demographic and tumor characteristics, diagnostic procedures, treatment, and vital status. Quality assurance guidelines established by the North America Association of Central Cancer Registries for state registries were used to edit the data. Out of 397 (234 surviving and 163 deceased cases) colon cancer patients identified using medical records, 264 were confirmed with colon cancer in ACTUR data (66.2%). The absence of ACTUR data is most likely due to diagnosis or treatment not at military treatment facilities, incomplete ACTUR data, or case participants' misconception of his/her rectal cancer as colon cancer.
Serum samples.
Serum samples were obtained from the DoDSR. Samples were collected from active-duty and reserve personnel upon entry to the US military on average every one to two years starting in 1988 for HIV testing. We requested a maximum of 4 serum samples from each subject (n = 794) with an overall serum sample count of n = 2,752. According to standard procedures, blood specimens are drawn and shipped immediately at room temperature overnight to laboratories for HIV testing. After the samples are tested, they are frozen and shipped to DoDSR and stored in −30°C walk-in freezers. All procedures are generally completed within 48 hours and never thawed after they are frozen. Once retrieved, they were shipped on dry ice to our laboratory and immediately frozen at −80°C.
Laboratory measurements.
MALDI-TOF MS profiling of weak cation exchange bead captured serum proteins was performed using a MB-WCX 100 Protein Profiling Kit (Bruker Daltonics). Serum proteins and peptides were eluted from the beads, concentrated, and spotted 1:2 with CHCA matrix (α-Cyano-4-hydroxycinnamic acid; 10 mg/mL in 50% acetonitrile and 2.5% trifluoroacetic acid) for MALDI-TOF profiling as per manufacturer's instructions and previous studies (18, 19). Ions in the mass range of 2,000 to 15,000 m/z were collected using an AutoFlex III MALDI-TOF/TOF mass spectrometer and spectra for each sample were visualized and exported using DataAnalysis 4.0 software from Bruker Daltonics. Mass spectra were then further processed and exported using Progenesis MALDI software (Nonlinear Dynamics). Total ion chromatogram (TIC) normalized peak heights were used for further analysis. Proteomic data were then separated into a 2/3 training set and 1/3 independent test set. The training set was used to create models (described below) that were qualified using the independent test set. This partitioning was performed randomly while maintaining matched pairs. Serum levels (ng/mL) of insulin-like growth factor-1 (IGF-1) and its binding protein, IGFBP-3, were determined using the commercial R&D Quantikine ELISA kit.
Data processing and analysis
We used a two-step procedure to identify discriminatory peaks using random forest (RF; 150 trees, min leaf = 1) feature selection and model optimization (20–22). The ensemble nature of a RF as well as the internal out-of-bag validation results in excellent performance on external data across many data domains. A RF can utilize categorical data, which is important in later steps when epidemiologic and medical claims data were included in classifier development.
During feature selection, we harnessed two properties of RFs: (i) out-of-bag (oob) error, and (ii) variable importance. Similar to the proposed method in ref. 23, we developed a two-step process to determine the most important variables in the training set. The first step was to construct a random forest using all variables (m) consisting of 150 trees and then extract the oobVI information, which was used to rank the variables in order of decreasing importance. The second step was to grow the random forest with the first k variables for k = 1 to m, calculating the oobError for all m forests. The model with the smallest error rate was selected that included k variables.
Model optimization was based on forward sequential feature selection with 10-fold cross-validation to generate models with minimum misclassification error and performance was qualified with the independent test set. An error rate <0.35 was considered acceptable for a model. All machine learning was performed using Matlab (v8.3.0.532; Mathworks).
On the basis of the peaks selected during feature selection and model optimization using only the training set, models were next generated by including additional epidemiologic variables (demographics, lifestyle, medical history, as well as IGF-1, and IGFBP-3 biomarker concentrations), while keeping the test set held-out. We first included epidemiologic variables from the Serum Repository (SR) only, which consisted of demographics, diagnosed medical conditions, medical procedures performed, and vaccinations. For the case–control pairs who completed the questionnaire, additional variables from the questionnaire were also used, which included race/ethnicity, highest education, service branch, marital status, religion, household income, number of people living in household, personal history of colon and rectal polyps, family history of cancer, physical activity, history of alcohol, history of smoking, deployment status, BMI (classified according to World Health Organization 2014), other health conditions, and treatments related to colon cancer. As questionnaire data was collected for two time periods, variables at corresponding time points were respectively used in the models. As a result, there were three sets of epidemiologic and surveillance variable sets used for modeling: (i) SR only, (ii) SR plus questionnaire information related to one year before diagnosis or reference date [SR+Q (diag)], and (iii) SR plus questionnaire information related to the time before first blood draw [SR+Q (bfbd)]. In all models, peaks selected during feature selection and model optimization were used as a base and additional epidemiologic and surveillance variables were added sequentially. Using this approach, sequential feature selection was used with 10-fold cross-validation to generate models with minimum misclassification error and performance was qualified with the independent test set.
Peptide identification
Next, we attempted to identify peptides of interest using high-resolution tandem mass spectrometry (Thermo Orbitrap Elite). To identify these peptides, we first selected serum samples with higher than average intensities of discriminatory peaks used in modeling and then generated a single case–control pooled sample. The pooled sample was injected into a C18 reversed phase nano-LC on a 75-μmol/L x 15-cm capillary column packed in-house (YMC ODS-AQ 120Å S5; Waters Corporation) using a 120-minute linear gradient from 5% to 50% buffer B (97.8% acetonitrile, 2% HPLC H2O, 0.2% formic acid). Chromatography was carried out at 200 nL/minute using an Ultimate 3000 nanoflow system (Dionex) directly interfaced to an Orbitrap Elite tandem mass spectrometer (Thermo Fisher Scientific). Data-dependent acquisition was set to select the top 10 ions for MS/MS CID fragmentation at normalized collision energy of 35% along with dynamic exclusion, enabled with a repeat count of three, a repeat duration of 30 seconds, and exclusion duration of 180. Acquired data files were converted to peak lists using ProteomeDiscover (v1.4) and these resulting Mascot generic format files were searched using the Mascot algorithm (v2.4.1). The database used the UniProtKB 2015_05 release comprised of the SwissProt, SwissProt varsplic, and TrEMBL releases (167,678 entries); taxonomy Homo sapiens was specified. The decoy search parameter was specified within Mascot to calculate local FDRs with the following variable modifications: no enzyme, N-term pyroGlu, Met oxidation, and Asn/Gln deamidation with a precursor tolerance of 20 ppm and fragment ion tolerance of 0.8 kDa. There were 107 protein families identified in the control pool at 4.65% local FDR and 102 identified protein families in the case pool at 5.17% local FDR. Using the N-term acetyl variable modification, there were 96 protein families identified in the control pool at 0.00% local FDR and 104 identified protein families in the case pool at 2.13% local FDR. The results were mined to find confidently assigned peptide sequences that aligned with the peaks of interest identified during the MALDI-TOF profiling.
Results
Table 1 shows the distribution of demographic characteristics by case–control status among individuals with serum samples (SR) and those with serum samples who also completed the questionnaire (SR+Q). For both the SR and SR+Q groups, cases and controls were similar in the distribution of matching variables (age, gender, race, and number of serum samples). Among participants in the SR+Q group, cases were more likely to be in the Army and Air Force and less likely to be in the Navy and Marines compared with controls. The proportion of protestant and subjects of unknown religion seemed higher in cases than controls while the proportion of those with no religion and Jewish religion tended to be lower among cases. In addition, cases were more likely to have a household income of $45,000–$59,999, but less likely to have $60,000 or more than controls. The two groups were similar in ethnicity and highest education received.
. | Serum repository . | Serum repository + questionnaire . | ||
---|---|---|---|---|
. | N = 794 . | N = 430 . | ||
. | Cases . | Controls . | Cases . | Controls . |
Characteristics . | n (%) . | n (%) . | n (%) . | n (%) . |
Agea (mean years ± SD) | 39.47 (8.69) | 39.54 (8.70) | 41.12 (8.07) | 41.22 (8.09) |
Gender | ||||
Male | 331 (83.4) | 331 (83.4) | 177 (82.3) | 177 (82.3) |
Female | 66 (16.6) | 66 (16.6) | 38 (17.7) | 38 (17.7) |
Race | ||||
White | 276 (69.5) | 276 (69.5) | 162 (75.4) | 162 (75.4) |
Black | 84 (21.2) | 84 (21.2) | 33 (15.4) | 33 (15.4) |
Other/Unknown | 37 (9.3) | 37 (9.3) | 20 (9.3) | 20 (9.3) |
No. of serum samples | ||||
1 | 23 (5.8) | 23 (5.8) | 13 (6.1) | 13 (6.1) |
2 | 44 (11.1) | 44 (11.1) | 20 (9.3) | 20 (9.3) |
3 | 54 (13.6) | 55 (13.9) | 29 (13.5) | 29 (13.5) |
4 | 276 (69.5) | 275 (69.3) | 153 (71.2) | 153 (71.2) |
Vital status | ||||
Alive | 227 (57.1) | 397 (100) | 208 (96.8) | 215 (100) |
Deceased | 170 (42.9) | 0 | 7 (3.3) | 0 |
Ethnicity | ||||
Hispanic | — | — | 11 (5.1) | 7 (3.3) |
Not Hispanic | — | — | 204 (94.9) | 208 (96.7) |
Unknown | — | — | 0 | 0 |
Highest educationa | ||||
High school or less | — | — | 19 (8.8) | 22 (10.2) |
Some college, technical, vocational | — | — | 82 (38.2) | 73 (34.0) |
College | — | — | 47 (21.9) | 49 (22.8) |
Graduate, professional | — | — | 67 (31.2) | 70 (32.6) |
Unknown | — | — | 0 | 1 (0.5) |
Service brancha | ||||
Army | — | — | 68 (31.6) | 49 (22.8) |
Navy | — | — | 64 (29.8) | 108 (50.2) |
Air Force | — | — | 55 (25.6) | 24 (11.2) |
Marines | — | — | 16 (7.4) | 31 (14.4) |
Other/Unknown | — | — | 12 (5.6) | 3 (1.4) |
Religion | ||||
None | — | — | 31 (14.4) | 40 (18.6) |
Protestant | — | — | 122 (56.7) | 107 (49.8) |
Jewish | — | — | 2 (0.9) | 4 (1.9) |
Catholic | — | — | 51 (23.7) | 57 (26.5) |
Other | — | — | 5 (2.3) | 5 (2.3) |
Unknown | — | — | 4 (1.8) | 2 (0.9) |
Household incomea | ||||
<$29,999 | — | — | 16 (7.4) | 12 (5.6) |
$30,000–$44,999 | — | — | 36 (16.7) | 38 (17.7) |
$45,000–$59,999 | — | — | 60 (27.9) | 45 (20.9) |
>$60,000 | — | — | 102 (47.4) | 119 (55.4) |
Unknown | — | — | 1 (0.5) | 1 (0.5) |
People in householda | ||||
1 | — | — | 41 (19.1) | 43 (20.0) |
2 | — | — | 44 (20.5) | 43 (20.0) |
3 | — | — | 54 (25.1) | 45 (20.9) |
4 | — | — | 48 (22.3) | 52 (24.2) |
>4 | — | — | 27 (12.6) | 32 (14.9) |
Unknown | — | — | 1 (0.5) | 0 |
. | Serum repository . | Serum repository + questionnaire . | ||
---|---|---|---|---|
. | N = 794 . | N = 430 . | ||
. | Cases . | Controls . | Cases . | Controls . |
Characteristics . | n (%) . | n (%) . | n (%) . | n (%) . |
Agea (mean years ± SD) | 39.47 (8.69) | 39.54 (8.70) | 41.12 (8.07) | 41.22 (8.09) |
Gender | ||||
Male | 331 (83.4) | 331 (83.4) | 177 (82.3) | 177 (82.3) |
Female | 66 (16.6) | 66 (16.6) | 38 (17.7) | 38 (17.7) |
Race | ||||
White | 276 (69.5) | 276 (69.5) | 162 (75.4) | 162 (75.4) |
Black | 84 (21.2) | 84 (21.2) | 33 (15.4) | 33 (15.4) |
Other/Unknown | 37 (9.3) | 37 (9.3) | 20 (9.3) | 20 (9.3) |
No. of serum samples | ||||
1 | 23 (5.8) | 23 (5.8) | 13 (6.1) | 13 (6.1) |
2 | 44 (11.1) | 44 (11.1) | 20 (9.3) | 20 (9.3) |
3 | 54 (13.6) | 55 (13.9) | 29 (13.5) | 29 (13.5) |
4 | 276 (69.5) | 275 (69.3) | 153 (71.2) | 153 (71.2) |
Vital status | ||||
Alive | 227 (57.1) | 397 (100) | 208 (96.8) | 215 (100) |
Deceased | 170 (42.9) | 0 | 7 (3.3) | 0 |
Ethnicity | ||||
Hispanic | — | — | 11 (5.1) | 7 (3.3) |
Not Hispanic | — | — | 204 (94.9) | 208 (96.7) |
Unknown | — | — | 0 | 0 |
Highest educationa | ||||
High school or less | — | — | 19 (8.8) | 22 (10.2) |
Some college, technical, vocational | — | — | 82 (38.2) | 73 (34.0) |
College | — | — | 47 (21.9) | 49 (22.8) |
Graduate, professional | — | — | 67 (31.2) | 70 (32.6) |
Unknown | — | — | 0 | 1 (0.5) |
Service brancha | ||||
Army | — | — | 68 (31.6) | 49 (22.8) |
Navy | — | — | 64 (29.8) | 108 (50.2) |
Air Force | — | — | 55 (25.6) | 24 (11.2) |
Marines | — | — | 16 (7.4) | 31 (14.4) |
Other/Unknown | — | — | 12 (5.6) | 3 (1.4) |
Religion | ||||
None | — | — | 31 (14.4) | 40 (18.6) |
Protestant | — | — | 122 (56.7) | 107 (49.8) |
Jewish | — | — | 2 (0.9) | 4 (1.9) |
Catholic | — | — | 51 (23.7) | 57 (26.5) |
Other | — | — | 5 (2.3) | 5 (2.3) |
Unknown | — | — | 4 (1.8) | 2 (0.9) |
Household incomea | ||||
<$29,999 | — | — | 16 (7.4) | 12 (5.6) |
$30,000–$44,999 | — | — | 36 (16.7) | 38 (17.7) |
$45,000–$59,999 | — | — | 60 (27.9) | 45 (20.9) |
>$60,000 | — | — | 102 (47.4) | 119 (55.4) |
Unknown | — | — | 1 (0.5) | 1 (0.5) |
People in householda | ||||
1 | — | — | 41 (19.1) | 43 (20.0) |
2 | — | — | 44 (20.5) | 43 (20.0) |
3 | — | — | 54 (25.1) | 45 (20.9) |
4 | — | — | 48 (22.3) | 52 (24.2) |
>4 | — | — | 27 (12.6) | 32 (14.9) |
Unknown | — | — | 1 (0.5) | 0 |
aAt diagnosis for cases and corresponding reference date for controls.
Table 2 presents the m/z values of proteomic peaks identified from feature selection and model optimization for each time interval. After optimization, the only error rate lower than the threshold level of 0.35 was observed for one year prior to colon cancer diagnosis. The peaks identified had m/z values of 3,119.32, 2,886.67, 2,939.23, and 5,078.81 and a sensitivity of 69% and specificity of 67%.
. | Feature selection . | Optimization . | ||
---|---|---|---|---|
Time intervala (y) . | Protein peaks (m/z) . | Protein peaks (m/z) . | Sensitivity/specificity . | Error rate . |
1 | 3,288.69, 2,818.53, 3,119.32, 2,886.67, 2,835.08, 3,257.54, 2,939.23, 5,078.81, 4,227.07, 5,621.97, 3,579.74 | 3,119.32, 2,886.67, 2,939.23, 5,078.81 | 0.691/0.673 | 0.318 |
2 | 4,247.51, 1,999.89, 2,754.29, 6,186.56, 5,427.29 | 4,247.51, 2,754.29, 6,186.56, 5,427.29 | 0.652/0.370 | 0.489 |
3 | 6,004.53, 2,285.10 | 6,004.53, 2,285.10 | 0.452/0.548 | 0.500 |
4–5 | 3,682.93, 2,522.61, 2,506.06, 2,909.06, 3,523.29, 4,247.51, 3,642.04, 2,080.68, 2,133.24 | 2,909.06, 4,247.51, 2,080.68 | 0.308/0.590 | 0.551 |
6–8 | 3,650.80, 2,818.53, 4,185.21, 4,339.98, 3,443.46, 3,086.22, 2,273.42, 5,671.62, 4,518.12, 2,313.33, 2,990.83, 2,583.94, 3,097.90, 7,186.26, 5,471.09, 2,430.14, 3,257.54, 3,241.97 | 2,990.83 | 0.500/0.344 | 0.578 |
8+ | 4,128.75, 2,293.86, 3,147.55, 4,339.98, 5,804.00, 4,185.21, 7,562.00, 2,835.08, 4,247.51, 6,785.21, 3,916.55, 5,015.53, 6,186.56 | 3,147.55, 4,185.21, 7,562.00, 6,186.56 | 0.612/0.510 | 0.439 |
. | Feature selection . | Optimization . | ||
---|---|---|---|---|
Time intervala (y) . | Protein peaks (m/z) . | Protein peaks (m/z) . | Sensitivity/specificity . | Error rate . |
1 | 3,288.69, 2,818.53, 3,119.32, 2,886.67, 2,835.08, 3,257.54, 2,939.23, 5,078.81, 4,227.07, 5,621.97, 3,579.74 | 3,119.32, 2,886.67, 2,939.23, 5,078.81 | 0.691/0.673 | 0.318 |
2 | 4,247.51, 1,999.89, 2,754.29, 6,186.56, 5,427.29 | 4,247.51, 2,754.29, 6,186.56, 5,427.29 | 0.652/0.370 | 0.489 |
3 | 6,004.53, 2,285.10 | 6,004.53, 2,285.10 | 0.452/0.548 | 0.500 |
4–5 | 3,682.93, 2,522.61, 2,506.06, 2,909.06, 3,523.29, 4,247.51, 3,642.04, 2,080.68, 2,133.24 | 2,909.06, 4,247.51, 2,080.68 | 0.308/0.590 | 0.551 |
6–8 | 3,650.80, 2,818.53, 4,185.21, 4,339.98, 3,443.46, 3,086.22, 2,273.42, 5,671.62, 4,518.12, 2,313.33, 2,990.83, 2,583.94, 3,097.90, 7,186.26, 5,471.09, 2,430.14, 3,257.54, 3,241.97 | 2,990.83 | 0.500/0.344 | 0.578 |
8+ | 4,128.75, 2,293.86, 3,147.55, 4,339.98, 5,804.00, 4,185.21, 7,562.00, 2,835.08, 4,247.51, 6,785.21, 3,916.55, 5,015.53, 6,186.56 | 3,147.55, 4,185.21, 7,562.00, 6,186.56 | 0.612/0.510 | 0.439 |
NOTE: Unique samples from matched case-control pairs with case confirmation in ACTUR were used.
aTime before cancer diagnosis (cases) or reference date (controls).
Epidemiologic factors were then assessed in addition to final identified peaks from time interval 1 (Table 3). For the SR group, all factors selected into the model were vaccination related. However, sensitivity and specificity remained similar to the model without these factors. When questionnaire information at the collection of the first blood sample were considered in addition to factors from medical records, histories of other cancers and tonsillectomy were retained in the final model, which improved the specificity from 67% to 76%. When questionnaire data at colon cancer diagnosis were assessed, smoking at diagnosis was found to be significant and increased the sensitivity from 69% to 76%.
Model . | Error rate . | Sensitivity/Specificity . | Final variablesa . |
---|---|---|---|
SR | 0.33 | 0.691/0.655 | 3,119.316895 |
2,886.669922 | |||
2,939.234619 | |||
5,078.806641 | |||
Adenovirus vaccine | |||
Anthrax vaccine | |||
Influenza H1N1 vaccine | |||
SR + Q (bfbd) | 0.28 | 0.680/0.760 | 3,119.316895 |
2,886.669922 | |||
2,939.234619 | |||
5,078.806641 | |||
Other cancer | |||
Tonsillectomy | |||
SR + Q (diag) | 0.28 | 0.760/0.680 | 3,119.316895 |
2,886.669922 | |||
2,939.234619 | |||
5,078.806641 | |||
Smoking at diagnosis |
Model . | Error rate . | Sensitivity/Specificity . | Final variablesa . |
---|---|---|---|
SR | 0.33 | 0.691/0.655 | 3,119.316895 |
2,886.669922 | |||
2,939.234619 | |||
5,078.806641 | |||
Adenovirus vaccine | |||
Anthrax vaccine | |||
Influenza H1N1 vaccine | |||
SR + Q (bfbd) | 0.28 | 0.680/0.760 | 3,119.316895 |
2,886.669922 | |||
2,939.234619 | |||
5,078.806641 | |||
Other cancer | |||
Tonsillectomy | |||
SR + Q (diag) | 0.28 | 0.760/0.680 | 3,119.316895 |
2,886.669922 | |||
2,939.234619 | |||
5,078.806641 | |||
Smoking at diagnosis |
Abbreviations: bfbd, before first blood draw; diag, before diagnosis; Q, questionnaire; SR, serum repository.
aPeaks included in each model are from time period 1 in Table 2.
Table 4 shows the identification of peptides using serum samples with higher intensities of the identified peaks. Three peaks (2,886.670, 2,939.235, and 3,119.317 m/z) out of the four were identified; two (2,886.67 and 3,119.32 m/z) were histone acetyltransferases and one (2,939.24 m/z) was a transporting ATPase subunit.
Peak (m/z) . | UniProt ID . | Description . | Mcalc . | Δ Mass . | Sequence . |
---|---|---|---|---|---|
2,886.670 | Q92794 | Histone acetyltransferase (KAT6A) | 2,886.386 | 0.72 | PSAVAMQAGPRALAVQRGMNMGVNLMPT + 3 Deamidated (NQ); Oxidation (M) |
2,939.235 | Q13733 | Sodium/potassium-transporting ATPase subunit alpha-4 (ATP1A4) | 2,937.5243 | 0.71 | LRTELRPGETLNVNFLLRMDRAHE + Acetyl (N-term); Oxidation (M) |
3,119.317 | Q7Z6C1 | Histone acetyltransferase p300 (EP300) | 3,118.3464 | 0.03 | QQGSPQMGGQTGLRGPQPLKMGMMNNPNP + 4 Deamidated (NQ); 4 Oxidation (M) |
5,078.807 | — | — | — | — | — |
Peak (m/z) . | UniProt ID . | Description . | Mcalc . | Δ Mass . | Sequence . |
---|---|---|---|---|---|
2,886.670 | Q92794 | Histone acetyltransferase (KAT6A) | 2,886.386 | 0.72 | PSAVAMQAGPRALAVQRGMNMGVNLMPT + 3 Deamidated (NQ); Oxidation (M) |
2,939.235 | Q13733 | Sodium/potassium-transporting ATPase subunit alpha-4 (ATP1A4) | 2,937.5243 | 0.71 | LRTELRPGETLNVNFLLRMDRAHE + Acetyl (N-term); Oxidation (M) |
3,119.317 | Q7Z6C1 | Histone acetyltransferase p300 (EP300) | 3,118.3464 | 0.03 | QQGSPQMGGQTGLRGPQPLKMGMMNNPNP + 4 Deamidated (NQ); 4 Oxidation (M) |
5,078.807 | — | — | — | — | — |
Discussion
Using serial prediagnostic serum samples and collected epidemiologic information, this study identified certain proteomic profiles that have the potential to discriminate colon cancer patients from randomly selected matched controls within the military. Proteomic peaks (2,886.67, 2,939.24, 3,119.32, and 5,078.81 m/z) were identified one year prior to colon cancer diagnosis with a sensitivity of 69% and a specificity of 67%. When epidemiologic information was also considered, factors at colon cancer diagnosis reduced the error rate and increased the sensitivity while factors at first blood sample collection were found to increase specificity.
Of the four proteomic peaks, three were identified as peptides derived from KAT6A, ATP1A4, and EP300. Sodium/potassium–transporting ATPase subunit alpha-4 (ATP1A4) is encoded by the ATP1A4 gene and acts as a catalyst in the hydrolysis of ATP in moving sodium and potassium ions across the plasma membrane (24). In addition, histone acetyltransferase KAT6A encoded by KAT6A gene is a component of the MOZ/MORF complex and acetylates residues on histone H3 and H4 (24). The KAT6A gene has been found to play a role in breast carcinogenesis (25) while the MOZ complex has been implicated in cellular senescence inhibition in mouse embryonic fibroblasts (26). The EP300 protein (E1A binding protein P300) is encoded by tumor suppressor gene EP300 and is also involved in histone acetyltransferase activity and transcription regulation through chromatin remodeling (24). One study found that frameshift mutations in EP300 occurred in gastric and colorectal cancers where EP300 expression was lost in 12%–24% of patients (27).
Identification of proteomic markers from samples collected in the year prior to colon cancer diagnosis, but not in those collected earlier, is theoretically reasonable. Conceptually, the closer the collection date of a prediagnostic sample to colon cancer diagnosis, the higher the likelihood of identifying a cancer-related protein. To the best of our knowledge, there have been no proteomic studies on colon cancer detection using serial prediagnostic serum samples. In a study on ovarian cancer (28), prediagnostic serum samples analyzed using the CA 125 immunoassay and SELDI-TOF-MS were collected from ovarian cancer patients and age-matched controls. The study showed that CA125 was elevated in 40 of 65 (61.5%) serum samples collected less than one year prior to cancer diagnosis, but in only 1 of 50 (2%) samples collected more than one year prior to cancer diagnosis. Although this study was not on colon cancer, results suggest that protein biomarkers may be more detectable in the time period closer to cancer diagnosis.
As the only proteomic study on colon cancer utilizing prediagnostic samples with both biologic and epidemiologic data, this study had several strengths. First, the study is unique in its utilization of serial prediagnostic samples to identify the earliest time at which discriminatory proteomic profiles might occur and whether the profiles may vary by time prior to colon cancer diagnosis. Our results suggest the significance of measuring serial samples, providing a basis for further research on early detection and prediction of colon cancer. Second, this study recruited an appropriate and comparable control group. In many previous studies, controls were based on hospital-based convenience samples, and not defined clearly. Thus, they might not come from the same target population as cases (29). As a result, biomarkers related to noncancer conditions might be more prevalent among cancer patients and ascertained as cancer biomarkers. In addition, the ambiguity on whether comparison groups were comparable might have limited the accuracy and reliability of results. In our study, the controls were from the same target population (active-duty members) and selected randomly based on certain matching criteria. Therefore, they were theoretically comparable with cases, except on colon cancer status. Third, in previous studies, sample collection, storage, storage time, and processing might differ between controls and cases, which might influence study results (15, 28, 30). In our study, prediagnostic serum samples were collected and processed with standardized procedures without referring to case–control status by the DoDSR. We also matched cases and controls by time at sample collection (therefore, storage). This can preclude the possibility that identified biomarkers result from the differences between cancer patients and controls in sample collection, storage, and processing. Fourth, our study included various demographic, medical, and epidemiologic data, which are not often collected and used in analysis. As protein levels are likely affected by various factors (31), proteomic detection/prediction of cancer should include additional demographic and epidemiologic information. While the effects of these factors are not clear at the present time, our findings suggest the significance of considering epidemiologic factors for improving early detection or prediction of colon cancer with identified and validated proteomic profiles.
While our study has several strengths, there are also limitations. First, we cannot exclude the possibility of protein degradation. While serum samples were frozen within 48 hours of sample collection, they were shipped at room temperature and stored in walk-in freezers at −30°C rather than −80°C. Although some studies showed that (i) the levels of high abundant proteins to which small proteins (e.g., cancer proteins) are bound and thus protected (32–34) remained similar after exposure to room temperature for 48 hours or several months (35–38) and (ii) low abundant proteins were relatively stable after being stored at −20°C to −40°C for many years (39–41), protein degradation was likely. We applied quality control procedures to exclude samples with poor quality to reduce the potential effects of degradation. However, we cannot exclude the possibility of degradation, particularly because we found peaks one year prior to diagnosis, but not in other earlier time intervals. Nevertheless, the identified proteomic differences between cases and controls show our capability to detect proteomic changes. Second, some proteomic biomarkers with extremely low abundance might not be detected using our measurement methods. This may be particularly true for prediagnostic samples collected many years prior to cancer diagnosis when cancer cells are scarce. As the analyte detection sensitivity for conventional mass spectrometry analysis is typically higher than 50 pg/mL, biomarkers with concentrations below this limit cannot be detected (29). Third, processing and analysis of high-throughput data are still in its infancy with many technical challenges (42–44). Various preanalytical data processing, algorithms, and model-building approaches can lead to low reproducibility of identified biomarkers. Finally, available analytical tools such as machine learning do not have built-in capabilities to analyze more complex data such as matched samples, samples from different time periods, and multiple samples within the same time period from the same person, all of which occurred in our study.
Using the unique resource of serial prediagnostic serum samples stored at the DoD Serum Repository, this nested case–control study identified potential proteomic biomarkers for early detection and prediction of colon cancer, although the sensitivities and specificities were not particularly high. Four peptide masses were found to be useful in discriminating colon cancer and as we have assigned putative identifications to all but one mass, future assay development is feasible using either SRM mass spectrometry–based assays or immunosorbent assays. This investigation also showed the possible significance of including epidemiologic factors in discriminant and prediction models. The processing and analysis of high-throughput data, especially those with multiple samples from not only different time periods, but also within the same period, are challenging, which warrants continuing and unremitted efforts in further analysis.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Disclaimer
The opinions and assertions expressed in this article represent the private views of the authors and do not reflect the official views of the U.S. Departments of the Army and Navy, the Uniformed Services University of the Health Sciences, the Department of Defense, the National Cancer Institute, or the U.S. government. Nothing in the presentation implies any Federal/Department of Defense endorsement.
Authors' Contributions
Conception and design: T.-C. Kao, K. Zhu
Development of methodology: S. Shao, B.A. Neely, T.-C. Kao, E.E. Jones, R.R. Drake, K. Zhu
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): J. Eckhaus, J. Bourgeois, J. Brooks, E.E. Jones, R.R. Drake, K. Zhu
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): S. Shao, B.A. Neely, T.-C. Kao, J. Brooks, R.R. Drake, K. Zhu
Writing, review, and/or revision of the manuscript: S. Shao, B.A. Neely, T.-C. Kao, R.R. Drake, K. Zhu
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): S. Shao, T.-C. Kao, J. Bourgeois
Study supervision: T.-C. Kao, K. Zhu
Acknowledgments
The project team thanks the following institutes for their contributions to or support for the project: Uniformed Services University of the Health Services, Armed Forces Health Surveillance Center, DoD Automated Tumor Registry, Henry F. Jackson Foundation, Defense Manpower Data Center, Former Armed Forces Institute of Pathology, Walter Reed National Military Medical Center, and Tricare Management Activity. We would also like to thank Dr. I-Jen Pan, Ms. Kanchana Perera, and Dr. Maciek Sasinowski for their contributions to the early stages of this project.
Grant Support
This study was supported by grant no. 5R01CA118707 from the National Cancer Institute (to K. Zhu).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.