A noninvasive blood test that could reliably detect early colorectal cancer or large adenomas would provide an important advance in colon cancer screening. The purpose of this study was to determine whether a serum proteomics assay could discriminate between persons with and without a large (≥1 cm) colon adenoma. To avoid problems of “bias” that have affected many studies about molecular markers for diagnosis, specimens were obtained from a previously conducted study of colorectal cancer etiology in which bloods had been collected before the presence or absence of neoplasm had been determined by colonoscopy, helping to assure that biases related to differences in sample collection and handling would be avoided. Mass spectra of 65 unblinded serum samples were acquired using a nanoelectrospray ionization source on a QSTAR-XL mass spectrometer. Classification patterns were developed using the ProteomeQuest® algorithm, performing measurements twice on each specimen, and then applied to a blinded validation set of 70 specimens. After removing 33 specimens that had discordant results, the “test group” comprised 37 specimens that had never been used in training. Although in the primary analysis, no discrimination was found, a single post hoc analysis, done after hemolyzed specimens had been removed, showed a sensitivity of 78%, a specificity of 53%, and an accuracy of 63% (95% confidence interval, 53-72%). The results of this study, although preliminary, suggest that further study of serum proteomics, in a larger number of appropriate specimens, could be useful. They also highlight the importance of understanding sources of “noise” and “bias” in studies of proteomics assays. (Cancer Epidemiol Biomarkers Prev 2008;17(8):2188–93)
Background and Purpose
In the United States, colorectal cancer is responsible for ∼150,000 cancers and 75,000 deaths per year (1). The large majority of colorectal cancers are thought to arise from adenomatous colon polyps; large adenomas (≥1 cm) are thought to become clinical cancer at a rate of roughly 1% per year (2) and, along with “early” and curable colorectal cancer, constitute a major target for screening.
A noninvasive test for colorectal cancer would be very useful clinically. Among the screening tests currently recommended, colonoscopy and sigmoidoscopy are invasive, require laxative preparation, and may incur risks of bleeding, perforation, or complications of conscious sedation. Fecal occult blood testing is noninvasive but has very limited sensitivity and must be done yearly or every-other-year, involving a collection process that may be bothersome and lead to low compliance among some people. There is an urgent need for a noninvasive procedure to identify patients with adenoma or carcinoma. One promising emerging technology is the use of serum profiling for the detection of cancers.
No markers for colon adenoma or carcinoma have been well-demonstrated, although a number of preliminary reports have suggested that serum-based signals may be associated with these growths (3-11). Profiling of serum using surface-enhanced laser desorption/ionization, followed by applying artificial neural network and support vector machine analysis, was used to identify patterns of markers that differentiated carcinoma, adenoma, and normal healthy people (8, 12); however, it is not clear that results were assessed in subjects totally independent of those used in “training” to rule out the possibility of overfitting (13).
Although potential serum markers for colorectal cancer have been studied by a number of groups, as noted above, the interpretation of such studies may be substantially limited by threats to validity. “Overfitting,” a problem caused by chance, can occur when patterns or a list of analytes is derived from a large number of candidates that is “fit” to a small number of subjects. Demonstrating that overfitting did not occur can be done by assessing the model derived in the “training set” using subjects in a “validation set” that is totally independent of those used in training (13). Bias can occur when systematic differences among the compared groups account for the “discrimination” found (14).
This study was designed to determine if serum could be used to discriminate between people with and without large adenomatous colon polyps. To achieve the goal of reducing or eliminating the possibility of bias (14), the population used was one undergoing screening colonoscopy in which bloods were drawn before the procedure and before the “true state” was known. This feature of the study design helps prevent bias that could occur from differential handling of specimens. To help achieve the goal of avoiding chance as the explanation for results, total independence of the “validation” set was maintained throughout the entire study.
Bias, the most serious problem in nonexperimental clinical research (14), can come from many sources including the study population (e.g., if there are age, ethnic, or gender differences between the adenoma and adenoma-free subjects), the metabolic status of the subjects prior to serum collection (e.g., fed, fasted, intestines evacuated), the way the bloods are collected (e.g., tube type, site of collection, timing relative to an invasive procedure), the way the serum is processed (e.g., length of time for clotting, temperature, clot removal) and stored (e.g., time to freezer, freezer temperature, freeze/thaw history), and the way the samples are analyzed (e.g., order in which the samples are analyzed, days on which different sample types are analyzed). In an ideal study, all these factors are controlled and variations are accounted for in a balanced experimental design; such control rarely happens in retrospectively collected sample collections, and it may be logistically difficult to achieve in a prospective study if the disease under study has a low incidence.
This study uses a set of already collected sera that, because they were collected before the diagnosis was known, were believed to have considerably less bias in their collection and handling than samples in many other studies. The clinical question in this study—whether a serum proteomics approach can discriminate persons with and without large adenomatous polyps of the colon—was chosen in part because of the availability of a rigorously collected set of specimens; however, the question is also clinically very relevant because large adenomas (>1 cm) are important precursors of colorectal cancer and constitute important targets that clinicians would like to discover and remove.
Materials and Methods
The overall strategy was to use specimens from a group of subjects who had been colonoscopically screened for colon neoplasia and in whom blood samples had been drawn in a standardized manner before the colonoscopy procedure, so that bloods would be collected and handled in a uniform and “blinded” manner.
All aspects of the study population, their blood sample quality, selection, collection, and processing to serum were under the direction of University of North Carolina investigators (C. Martin, J. Galanko, and T. Keku). This included the association of pathology with the samples, the creation of sample IDs, and the selection of those samples to use in model development as opposed to blinded validation. The key to the blinded validation set were known only to University of North Carolina investigators (C. Martin and J. Galanko) and have not, even postanalysis, been unblinded to other participants, ensuring the complete independence of the data analysis and the scoring of the blinded samples. The source of specimens was two large cross-sectional studies conducted by the same research team for the Diet and Health Studies III and IV, conducted between 1998 and 2002 at the University of North Carolina Hospital, a large referral center in Chapel Hill, NC (15, 16). Both studies were designed to assess neoplasia etiology in relation to lifestyle and biological risk factors, as measured by questionnaires, blood, and biopsies. Both studies were approved by the Committee for the Protection of the Rights of Human Subjects at the University of North Carolina School of Medicine, and all participants gave written informed consent.
Participants were recruited from consecutive patients who underwent screening colonoscopy at the University of North Carolina Hospital during the recruitment periods. Eligible patients were 30 years of age or older, sufficiently proficient cognitively in the English language to complete a telephone risk factor interview, and had no known history of familial polyposis, colitis, previous colonic resection, previous colon cancer, or adenoma. Age, ethnicity, and gender were recorded along with a reference number anonymized to the researchers. Bloods were drawn prior to initiation of the procedure and were transported to a laboratory for processing of the serum without knowledge of the colonoscopy result. Polyps were removed at the time of colonoscopy by board-certified gastroenterologists or supervised gastroenterology trainees and were sent to a central laboratory for histologic interpretation. A single pathologist reviewed all slides and classified polyps as adenomatous, hyperplastic, or other (e.g., lymphoid nodules, inflammatory, no pathologic diagnosis). The anonymized records were then annotated, retrospectively, with the outcomes of all patients—normal, healthy; normal with hyperplastic; small adenoma +/− hyperplastic; medium adenoma +/− hyperplastic; and large adenoma +/− hyperplastic. For the purposes of this study, large adenomas were those ≥1 cm in diameter as estimated by the colonoscopist. Participants with multiple adenomas were categorized according to the largest adenoma. Participants whose colonoscopy procedure did not achieve complete visualization of the colon to the cecum, or who had unsatisfactory preparation, were excluded.
In both studies, prior to the day of the examination, all patients followed the same regimen which involved a 24-h fast and bowel cleansing using the laxative Go-Lytely, a proprietary mixture of polyethylene glycol and electrolytes (sodium sulfate, sodium bicarbonate, sodium chloride, and potassium chloride), or Phospho-Soda (Fleet), a sodium phosphate saline solution, according to a standard protocol. On the day of the procedure, prior to administration of medication, an i.v. catheter was inserted into the patient's arm and 10 mL of blood was immediately withdrawn using a royal blue top mineral-free vacutainer and stored temporarily in a refrigerator at 4°C in the clinical gastroenterology unit to allow clotting. Tubes were then transported to the lab in an adjacent building for serum separation within 2 to 6 h. Diagnoses (e.g., “normal” or “adenoma”) were not noted on the specimen label, and all personnel handling the specimens were unaware of the diagnosis. Specimens from patients seen late in the day were processed the next morning after specimens had been stored at 4°C overnight. Because of this broad range of times from collection to freezing, it is possible that “noise” (that would obscure or degrade signals from the adenoma) could be introduced into specimens, thus preventing the detection of a signal or difference (17, 18). Tubes were centrifuged for 5 min at 2,000 rpm using a fixed-angle Adams compact clinical centrifuge (Becton Dickinson) with a standard rotor, to separate serum that was then collected in 3.5-mL cryogenic vials that were labeled, placed in freezer boxes, and stored at −80°C. The time from spinning to freezing was not strictly controlled but was generally done within 20 min. All samples were aliquotted at a single time so that there was only a single freeze/thaw cycle prior to thawing for the current analysis. To prepare aliquots for analysis, vials were thawed on ice and then vortexed to mix contents thoroughly, before 250 μl aliquots were withdrawn and placed in sterile 1.5 mL Eppendorf tubes and stored at −80°C until they were shipped.
Prior to the start of any analysis, investigators at the University of North Carolina-Chapel Hill randomly assigned the relevant sera of the two kinds of subjects (large adenoma +/− hyperplastic; normal, no hyperplasia) into two groups. In the model development group, the sample identity of large adenoma versus normal was provided; in the blinded validation group, only an anonymous identifier was provided. The intent of the random selection to the training and validation groups was to minimize possible biases arising from unequal distribution of factors, such as age, sex, smoking status, and sera separation/processing times, which might contribute to analytic variability. Once assigned, the samples were shipped on dry ice to Correlogic Systems, Inc., for mass spectroscopy and data analysis.
Following model development and selection of the best model from the development phase, spectra from the blinded validation set were classified using that model, and the classification results were sent to the holder of the blinding key at University of North Carolina for scoring. The subsequent statistical analysis of the results and interpretation of the statistical significance of the classification were done solely by J. Galanko and C. Martin. A point estimate and 95% confidence interval for overall accuracy (true positives plus true negatives divided by the total number of predictions) was calculated to assess discrimination.
Analytic Mass Spectrometry
Analysis of the serum by mass spectrometry was done as a service by an independent research organization following detailed protocols established and provided by Correlogic Systems. Briefly, serum samples, stored in a −80°C freezer prior to use, were thawed at room temperature for 30 min and then mixed gently by vortex to ensure a complete suspension. For dilution, 2 μL of the serum was pipetted into 1.5 mL tubes containing 498 μL of mobile phase consisting of 50% (v/v) acetonitrile (Burdick & Jackson), 0.2% (v/v) formic acid (Suprapur; EM Science), to give a final 250-fold dilution. Samples were then mixed well and held at 4°C, overnight. Prior to analysis, samples were then centrifuged at 13,000 × g for 15 min at 4°C, and 150 μL of supernatant was transferred into individual wells of a prewashed 96-well microtiter plate. Duplicates of each serum preparation were placed in adjacent wells. The plates were then covered to prevent evaporation with a heat-sensitive film (ABgene) and sealed for 4 s using a thermo-sealer (ABgene). Prior to use, the 96-well sample plates (NUNC 267245, 0.5 mL polypropylene) were washed twice with deionized water, twice with mobile phase, and finally air-dried in an inverted orientation. For spectral analysis, sera were distributed evenly across each 96-well autosampler plate so that spectra representing the adenoma, nonadenoma, and blinded samples were acquired in a positionally and temporally independent manner.
Mass spectra were acquired using an ABI QSTAR-XL mass spectrometer, with an Advion Nanomate 100 automated nanoelectrospray system. Tuning and calibration of the spectrometer were done using 7.5 × 10−6 mol/L of CsI and 1 × 10−6 mol/L of sex pheromone inhibitor iPD1 Octapeptide (ALILTLVS), according to the manufacturer's recommendation. Samples were held at room temperature for analysis. The spray pressure on the source was set to 0.6 psi and the voltage to 1.6 kV. Five microliters of sample was picked up with a 1 μL of air gap and sprayed for 70 s. Contact closure started 5 s after spray initiation, when a stable spray had been established. The spectra were acquired in positive time of flight-mass spectrometry mode from 500 to 1400 m/z, using 30 2-s cycles with multi-channel acquisition on. Duplicate spectra were acquired for all samples.
Deriving Discriminating Patterns in the Training Data Set
Analysis of the raw mass spectral data and the generation of potential classification models were done by Correlogic Systems using methods previously established. Prior to modeling, all spectra were aligned by linear binning at 100 ppm over the range 500 to 1,100 m/z.
Three methods of data analysis were conducted and compared. Two were traditional classification schemes—k-nearest neighbor (kNN) and OneR (One Rule); the third was a nonlinear pattern recognition algorithm, ProteomeQuest® (Correlogic Systems). All methods were done using a 10-fold cross-validation strategy which holds out 10% of the model development set as a validation set. The results reported are the mean validation performances and standard deviations.
OneR is a simple classification algorithm that generates a one-level decision tree able to identify simple, yet accurate, classification rules that have been shown to be only slightly less accurate than state-of-the-art learning schemes. The strength of OneR is that it attempts to classify samples by identifying multiple cutoffs for a single feature. The level of classification obtained by this approach represents the extent to which a single feature can classify and provides a useful benchmark that any classification using multiple features must exceed. In contrast, both kNN and ProteomeQuest® use multiple features to classify.
To derive the best model using the training set provided, a set of classification rules or models was developed using 65 unblinded samples: 37 large adenomas (with or without a hyperplastic polyp) and 28 normals without a hyperplastic polyp. The “best discriminating” model was chosen and then used to classify the blinded validation set consisting of 20 large adenoma samples and 50 normals. Before any analyses were done, it was recognized that the overall sample size was suboptimal and did not meet the usual recommendations of Correlogic Systems, which recommended a minimum of 100 samples each for the normal and diseased subjects for model building in a proof of principle study. The reason for proceeding, though, was the potential advantage of the rigorous control of collection and handling of the specimens that were available.
Modeling Using ProteomeQuest®
Modeling was done using the ProteomeQuest® (Correlogic Systems) algorithm, which has been previously described (19, 20). Briefly, the algorithm uses an iterative procedure combining lead cluster mapping and a genetic algorithm to identify combinations of m/z features, the relative intensity ratios of which define a particular state. The resulting model is a centroid map. Each centroid is associated with a given state (diseased or normal) and is surrounded by a defined decision boundary. The centroid and decision boundary define a node. To score unknown samples, the intensities of the features used to form the map are extracted from the spectrum of interest and are plotted relative to each other on the map. The classification of the unknown is then defined by the identity of the node that the data points fall within.
Modeling using a strategy such as ProteomeQuest® is especially difficult when small numbers of samples are available to build models, because it becomes increasingly difficult to differentiate between truly meaningful patterns from “accidental” ones. One mitigating methodology to guard against accidental patterns is a 10-fold cross-validation strategy that was implemented, in which the model building set was divided into 10 unique, nonoverlapping subsets. Duplicate spectra were always kept together so that the spectra from any individual appeared only in a single subset. These were then assembled into 10 unique groups in which all but one subset was grouped for model building and the 10th subset was held out for model validation. Then, using a single combination of modeling variables, a model was built for each group. The resulting 10 models were designated as a “cross-validation group” of models, the combined performance of which should more accurately reflect the performance of those variable settings than any single model built with those variables. Each sample was scored as 1 (large adenoma) or 0 (normal) using each of the 10 models in the set. The 10 scores were then summed to generate a final classification of each sample as positive (sum >5), negative (sum <5), or indeterminate (sum = 5, a tie). Scoring was repeated twice for all samples, once for each of the two duplicate spectra from that sample. If discordant results were obtained from duplicate spectra, that specimen was categorized as indeterminate.
After making a set of models, the final model that would be used to score the blinded validation set was selected by assessing the performance characteristics of each cross-validation group of models; this assessment was done by calculating the means and standard errors of three variables of discrimination (accuracy, sensitivity, and specificity) across the 10 models in each cross-validation group. The cross-validation group model having the best discrimination in the model development set (model 10CVF9M75G100) consisted of 9 m/z features; this was the model used to classify the blinded samples.
Results of Model-Building: the Training Set
OneR Results. The best model generated by OneR identified three cutoff values for a single feature that yielded a mean validation accuracy of 61.2 ± 7.4%, a sensitivity of 66.1 ± 19.1%, and a specificity of 54.2 ± 20.5%; a nonsignificant classification over random assignment.
kNN Results. The best model by kNN generated a 9-nearest neighbor model with a mean validation accuracy of 57.3 ± 17.0%, a sensitivity of 66.1 ± 21.4%, and a specificity of 46.3 ± 29.9%; a nonsignificant classification over random assignment.
ProteomeQuest® Results. The ProteomeQuest® model selected in the development phase had an accuracy of 63.0 ± 11.0%, a sensitivity of 72.0 ± 16.7%, and a specificity 52.0 ± 22.6%. Although these results seem better than those of either the OneR or kNN methods superficially, the large standard deviations show that there is no significant classification power in this model.
Results of Model-Testing: the Validation Set
The ProteomeQuest® model was selected to predict the adenoma status for the 70 blinded validation samples. Because duplicate spectra were acquired for each sample, there were four possible outcomes: (a) concordant positive, (b) concordant negative, (c) discordant, or (d) indeterminate as a result of one or both acquisitions for a given sample failing due to spray blockage or technical problems.
Only 37 of 70 samples had concordant positive or concordant negative results. These concordant results are shown in Table 1. Predicted status was compared with the known adenoma status from colonoscopy to assess the discriminatory ability of the model. Model accuracy was 51% and was not sufficiently different from 50% to reject the null hypothesis of no discrimination.
|Model prediction .||Colonoscopy status|
|.||Adenoma .||No adenoma .||.|
|Model prediction .||Colonoscopy status|
|.||Adenoma .||No adenoma .||.|
NOTE: Sensitivity, 10/12 = 83%; specificity, 9/25 = 36%; accuracy, 51% (95% confidence interval, 38-65%).
Results of Model-Testing: Post Hoc Analysis of the Validation Set
One concern raised at Correlogic during the preparation of samples was the discoloration of a number of sera, which indicated that significant hemolysis had occurred during collection in the clinical laboratory. To address this concern, which might be reflected as noise in the spectra, one post hoc analysis was done after the primary prespecified analysis (shown in Table 1) had been completed. To conduct the post hoc analysis, an independent technician identified hemolyzed specimens in the validation set (while still blinded regarding status as adenoma or normal); these were then removed from the validation set in the post hoc analysis. The samples remaining in the now-smaller “validation set” remained totally blinded to the Correlogic investigators during this post hoc analysis. The remaining samples were then classified by University of North Carolina-Chapel Hill investigators. In this post hoc analysis of unhemolyzed specimens only, a modest but statistically significant degree of discrimination is shown (Table 2). Model accuracy was 63%. The 95% confidence interval excludes the null value (50%), permitting rejection of the null hypothesis of no discrimination.
|Model prediction .||Colonoscopy status|
|.||Adenoma .||No adenoma .||.|
|Model prediction .||Colonoscopy status|
|.||Adenoma .||No adenoma .||.|
NOTE: Sensitivity, 7/9 = 78% (95% confidence interval, 40-96%); specificity, 8/15 = 53% (95% confidence interval, 27-78%); accuracy, 63% (95% confidence interval, 53-72%).
This study used a rigorously collected set of specimens to assess whether molecular signals in serum can be used to discriminate between persons with adenomas and those with normal colons. This study did not find, in the primary analysis, that serum proteomics could discriminate between persons with large adenomas and normal persons; however, the single post hoc analysis (done to reduce possible noise from hemolyzed specimens) did suggest discrimination. This finding of discrimination, if shown in other high-quality specimens, could be clinically important.
Several reasons may explain failure to find discrimination. First, of course, may be that, biologically, there is no “signal” in serum that distinguishes people with or without an adenoma.
Second, the very small number of subjects limited the ability of any approach to find a signal, even if it was there. In other words, the study was so small that, in the cross-validation strategy used to minimize overfitting, a relatively low level of classification was found in the training set. In this setting, it simply was not expected that much discrimination would be “confirmed” in the validation set. This expectation was borne out in the blinded validation in the primary analysis, although discrimination was suggested in the one post hoc analysis (see Table 2).
Third, it is possible that noise in the samples overwhelmed any signal that may have been present. Although the sample set displays no apparent systematic biases—that could lead to discrimination due to nonadenoma causes—a weakness of the study was that, at multiple points, specimens may have been handled in ways that were suboptimal for preserving the signal in serum and thus may have caused noise. The presence of hemolyzed samples indicates that this is a clear possibility. This situation might be understandable in the sense that, in the original studies, the preservation of uncharacterized proteomic signals in serum was not a priority. Possible important sources of noise in this study include how samples were handled (the time window from collection to spinning was sometimes long, although the time from spinning to freezing was consistent, ∼20 min with little variation). Another concern was instrument performance, in that in the blinded sample set only 37 of 70 samples produced a reproducible spray. Of the 33 other samples, 10 had discordant sprays, but 23 failed to spray appropriately. This problem probably occurred because of the use of a different spray nozzle for each sample when using the Advion nano-ESI chip; more recent experience has shown much better consistency when using a more traditional ESI source using the same nozzle for each spray.
The negative results of this study should not be considered “conclusively” negative, for the reasons discussed above. Furthermore, the single post hoc analysis done suggests that there might be a signal in the serum that distinguishes subjects with large adenomas from those without. The results of this study also highlight the importance of understanding sources of noise in a proteomics study, in addition to understanding and addressing sources of “bias”.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Grant support: National Cancer Institute (2-R01-CA044684), the Epidemiology of Rectal Mucosal Proliferation; National Institute of Diabetes and Digestive and Kidney Diseases (5P30DK034987), the Center for Gastrointestinal Biology and Disease; and a Population Sciences Research Award from the UNC Lineberger Comprehensive Cancer Center.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.