Abstract
Background: Ovarian cancer diagnosis is problematic because the disease is typically asymptomatic, especially at the early stages of progression and/or recurrence. We report here the integration of a new mass spectrometric technology with a novel support vector machine computational method for use in cancer diagnostics, and describe the application of the method to ovarian cancer.
Methods: We coupled a high-throughput ambient ionization technique for mass spectrometry (direct analysis in real-time mass spectrometry) to profile relative metabolite levels in sera from 44 women diagnosed with serous papillary ovarian cancer (stages I-IV) and 50 healthy women or women with benign conditions. The profiles were input to a customized functional support vector machine–based machine-learning algorithm for diagnostic classification. Performance was evaluated through a 64-30 split validation test and with a stringent series of leave-one-out cross-validations.
Results: The assay distinguished between the cancer and control groups with an unprecedented 99% to 100% accuracy (100% sensitivity and 100% specificity by the 64-30 split validation test; 100% sensitivity and 98% specificity by leave-one-out cross-validations).
Conclusion: The method has significant clinical potential as a cancer diagnostic tool. Because of the extremely low prevalence of ovarian cancer in the general population (∼0.04%), extensive prospective testing will be required to evaluate the test's potential utility in general screening applications. However, more immediate applications might be as a diagnostic tool in higher-risk groups or to monitor cancer recurrence after therapeutic treatment.
Impact: The ability to accurately and inexpensively diagnose ovarian cancer will have a significant positive effect on ovarian cancer treatment and outcome. Cancer Epidemiol Biomarkers Prev; 19(9); 2262–71. ©2010 AACR.
Introduction
Ovarian cancer (OC) is the most lethal of the gynecologic cancers and is the fifth leading cause of all cancer-related deaths among women (1). Although the 5-year survival rate for women diagnosed with the disease early in its progression is >90%, the survival rate for patients diagnosed at later stages is only ∼20% (2). The main challenge with OC is that it typically arises and progresses initially without well-defined clinical symptoms (3). Thus, successful diagnosis plays a central role in deciding appropriate therapy and improving patient prognosis.
Most OC blood tests in current clinical practice monitor level changes of a single molecule that has been shown to be elevated (or lowered) in a significant number of diseased patients. Although these tests are often not definitive per se, they may be of significant predictive value when combined with other procedures. However, single molecule-based OC diagnostic assays have had only limited diagnostic power. More recent research has focused on tests based on panels of biomarkers. For example, a recently developed test that looks at six serum proteins has been shown to be of significant diagnostic value in high ovarian cancer risk groups (e.g., BRCA1-positive patients; refs. 4, 5).
We report here on the coupling of a high-throughput ambient ionization technique for mass spectrometry (MS) with machine-learning approaches for the metabolomic classification of sera from ovarian cancer and control patients. This technique, known as direct analysis in real-time (DART; ref. 6), is one of the members of the rapidly growing family of open-air (ambient) ionization methods for MS (7) that also includes desorption electrospray ionization (8). We view this as a successful step towards the development of an accurate new approach for the diagnosis of ovarian and other cancers. In this DART MS test, a stream of excited metastables is used to desorb and chemically ionize a dried drop of derivatized serum (Fig. 1). A typical DART MS profile displays a multitude of signals corresponding to metabolites rapidly desorbed and ionized in a time-dependent fashion (Fig. 1, c.x.). These profiles are then used as input for a customized functional support vector machine (fSVM)–based machine-learning algorithm for the classification of serum samples. The assay distinguished between the cancer and control groups with 99% to 100% accuracy (100% sensitivity and 98-100% specificity) under two different data evaluation approaches.
Diagram of study design and workflow showing metabolomic investigation of serum samples for detection of ovarian cancer by DART-TOF MS. A, serum sample preparation: (i) protein precipitation, centrifugation, and separation of the metabolite-containing supernatant followed by (ii) evaporation of solvent to generate a metabolite-containing pellet. This pellet is then subjected to (iii) derivatization to increase the volatility of polar metabolites. B, schematic of the DART-TOF MS equipped with a custom-built sample arm: (iv) glow discharge compartment, (v) gas heater, and (vi) ionization region where the sample-carrying capillary is placed. Differentially pumped atmospheric pressure interface (vii) to transport ions towards the mass analyzer. Radiofrequency ion guide (viii) where ions are collisionally cooled prior to entering the orthogonal TOF mass analyzer (ix). C, typical data are acquired in a time-resolved fashion [x, three-dimensional contour plots of single runs corresponding to an ovarian cancer patient (top) and a control (bottom)]. The region of the time-resolved signal with best signal-to-noise ratio was averaged yielding profile mass spectra (xi) reflecting metabolic fingerprints. D, machine-learning techniques such as SVMs are used for building a multivariate classifier (xii, objects in original variable space; xiii, objects in classifier space).
Diagram of study design and workflow showing metabolomic investigation of serum samples for detection of ovarian cancer by DART-TOF MS. A, serum sample preparation: (i) protein precipitation, centrifugation, and separation of the metabolite-containing supernatant followed by (ii) evaporation of solvent to generate a metabolite-containing pellet. This pellet is then subjected to (iii) derivatization to increase the volatility of polar metabolites. B, schematic of the DART-TOF MS equipped with a custom-built sample arm: (iv) glow discharge compartment, (v) gas heater, and (vi) ionization region where the sample-carrying capillary is placed. Differentially pumped atmospheric pressure interface (vii) to transport ions towards the mass analyzer. Radiofrequency ion guide (viii) where ions are collisionally cooled prior to entering the orthogonal TOF mass analyzer (ix). C, typical data are acquired in a time-resolved fashion [x, three-dimensional contour plots of single runs corresponding to an ovarian cancer patient (top) and a control (bottom)]. The region of the time-resolved signal with best signal-to-noise ratio was averaged yielding profile mass spectra (xi) reflecting metabolic fingerprints. D, machine-learning techniques such as SVMs are used for building a multivariate classifier (xii, objects in original variable space; xiii, objects in classifier space).
Materials and Methods
Sample collection
Serum samples were obtained from the Ovarian Cancer Institute laboratory at Georgia Tech after approval by the Institutional Review Board from Northside Hospital and Georgia Institute of Technology (Atlanta, GA; Table 1). All donors were required to fast and to avoid medicine and alcohol for 12 hours prior to sampling (except for certain allowable medications, for instance, diabetics were allowed insulin). Following informed consent by donors, 5 mL of whole blood were collected by venipuncture into evacuated blood collection tubes that contained no anticoagulant (blood taken prior to the administration of anesthesia, immediately preceding surgery). Within 1 hour of venipuncture, serum was collected and 200-μL aliquots of each sample were stored in 1.5 mL microtubes at −80°C until ready to use.
Patients analyzed in this study
Patient ID . | Ovarian histopathology . | Stage/grade . | Age at surgery . | Menopause status . | Endometriotic cysts present? . |
---|---|---|---|---|---|
242 | Papillary serous carcinoma | IIIc/3 | 63 | Postmenopausal | No |
281 | Papillary serous carcinoma | III/1 | 66 | Postmenopausal | No |
454 | Papillary serous carcinoma | III/3 | 72 | Postmenopausal | No |
458 | Papillary serous carcinoma | IIIc/3 | 59 | Postmenopausal | No |
472 | Papillary serous carcinoma | IIIc/2-3 | 49 | Postmenopausal | No |
473 | Papillary serous carcinoma | IIIc/3 | 48 | Perimenopausal | No |
491 | Papillary serous carcinoma | I/ | 74 | Postmenopausal | No |
495 | Papillary serous carcinoma | IIIb/3 | 43 | Premenopausal | No |
512 | Papillary serous carcinoma | IIIb/2-3 | 59 | Postmenopausal | No |
517 | Papillary serous carcinoma | Ia/3 | 59 | Postmenopausal | No |
526 | Papillary serous carcinoma | IIIc/2-3 | 49 | Postmenopausal | No |
528 | Papillary serous carcinoma | IIIc/3 | 66 | Postmenopausal | No |
529 | Papillary serous carcinoma | IIIc | 67 | Postmenopausal | No |
533 | Papillary serous carcinoma | III/1 | 43 | Premenopausal | No |
537 | Papillary serous carcinoma | IIIa/2-3 | 64 | Postmenopausal | No |
542 | Papillary serous carcinoma | IV/3 | 61 | Postmenopausal | No |
551 | Papillary serous carcinoma | IIIc/IV/3 | 59 | Postmenopausal | No |
559 | Papillary serous carcinoma | IV/3 | 49 | Perimenopausal | No |
588 | Papillary serous carcinoma | IIIc/2-3 | 71 | Postmenopausal | No |
589 | Papillary serous carcinoma | IIIc/IV/3 | 46 | Premenopausal | No |
606 | Papillary serous carcinoma | IIIa/3 | 54 | Premenopausal | No |
617 | Papillary serous carcinoma | IIIc/2-3 | 64 | Postmenopausal | No |
620 | Papillary serous carcinoma | III/IV/3 | 62 | Postmenopausal | No |
632 | Papillary serous carcinoma | IIIb/3 | 65 | Postmenopausal | No |
643 | Papillary serous carcinoma | IIIb/2 | 59 | Postmenopausal | No |
644 | Papillary serous carcinoma | IIIb/1-2 | 46 | Postmenopausal | No |
647 | Papillary serous carcinoma | IIIb-c/3 | 68 | Postmenopausal | No |
651 | Papillary serous carcinoma | IIIb-c/3 | 46 | Perimenopausal | No |
655 | Papillary serous carcinoma | III/IV/3 | 75 | Postmenopausal | No |
659 | Papillary serous carcinoma | IIIc/IV/3 | 78 | Postmenopausal | No |
678 | Papillary serous carcinoma | IV/3 | 59 | Postmenopausal | No |
688 | Papillary serous carcinoma | IIIc/3 | 59 | Postmenopausal | No |
694 | Papillary serous carcinoma | IIIb/3 | 70 | Postmenopausal | No |
704 | Papillary serous carcinoma | III/IV/3 | 75 | Postmenopausal | No |
717 | Papillary serous carcinoma | IIIb/3 | 64 | Postmenopausal | No |
721 | Papillary serous carcinoma | IIIc/1 | 58 | Postmenopausal | No |
756 | Papillary serous carcinoma | IIIc/2 | 59 | Postmenopausal | No |
782 | Papillary serous carcinoma | IIIc/3 | 59 | Postmenopausal | No |
787 | Papillary serous carcinoma | IIIc/3 | 72 | Postmenopausal | No |
821 | Papillary serous carcinoma | IIIc/1-2 | 58 | Postmenopausal | No |
831 | Papillary serous carcinoma | IIIc/ | 69 | Postmenopausal | No |
864 | Papillary serous carcinoma | IIIc/3 | 60 | Postmenopausal | No |
876 | Papillary serous carcinoma | IIa/1 | 63 | Postmenopausal | No |
5010 | Papillary serous carcinoma | IIIc/1 | 59 | Postmenopausal | No |
440 | Within normal limits | N/A | 50 | Perimenopausal | No |
504 | Within normal limits | N/A | 48 | Premenopausal | No |
523 | Serous cystadenoma | N/A | 32 | Premenopausal | No |
534 | Within normal limits | N/A | 72 | Postmenopausal | No |
540 | Within normal limits | N/A | 59 | Postmenopausal | No |
541 | Within normal limits | N/A | 41 | Perimenopausal | No |
544 | Within normal limits | N/A | 49 | Perimenopausal | No |
552 | Within normal limits | N/A | 41 | Premenopausal | No |
612 | Within normal limits | N/A | 48 | Premenopausal | No |
614 | Within normal limits | N/A | 44 | Premenopausal | No |
615 | Within normal limits | N/A | 42 | Perimenopausal | No |
623 | Simple cyst | N/A | 54 | Postmenopausal | No |
627 | Within normal limits | N/A | 59 | Postmenopausal | No |
636 | Within normal limits | N/A | 71 | Postmenopausal | Yes |
650 | Cystic corpus luteum | N/A | 47 | Postmenopausal | No |
677 | Within normal limits | N/A | 68 | Postmenopausal | No |
691 | Within normal limits | N/A | 70 | Postmenopausal | No |
693 | Simple cyst | N/A | 60 | Postmenopausal | No |
697 | Within normal limits | N/A | 51 | Premenopausal | Yes |
698 | Functional cyst | N/A | 49 | Perimenopausal | No |
703 | Simple cyst | N/A | 42 | Premenopausal | No |
719 | Within normal limits | N/A | 55 | Perimenopausal | No |
733 | Within normal limits | N/A | 37 | Postmenopausal | No |
736 | Within normal limits | N/A | 45 | Premenopausal | No |
737 | Within normal limits | N/A | 41 | Perimenopausal | No |
740 | Functional cyst | N/A | 37 | Premenopausal | No |
749 | Simple cyst/cystic follicles | N/A | 56 | Perimenopausal | No |
750 | Serous cystadenoma | N/A | 41 | Postmenopausal | No |
751 | Within normal limits | N/A | 60 | Postmenopausal | No |
752 | Within normal limits | N/A | 74 | Postmenopausal | No |
755 | Within normal limits | N/A | 75 | Postmenopausal | No |
757 | Within normal limits | N/A | 84 | Postmenopausal | No |
759 | Within normal limits | N/A | 52 | Postmenopausal | No |
763 | Hemorrhagic cyst | N/A | 45 | Premenopausal | No |
765 | Within normal limits | N/A | 84 | Postmenopausal | No |
766 | Within normal limits | N/A | 36 | Premenopausal | No |
783 | Within normal limits | N/A | 52 | Premenopausal | No |
790 | Cystic follicles | N/A | 39 | Premenopausal | No |
796 | Within normal limits | N/A | 44 | Premenopausal | No |
808 | Within normal limits | N/A | 35 | Premenopausal | No |
828 | Simple cyst | N/A | 59 | Postmenopausal | No |
829 | Simple cyst | N/A | 33 | Postmenopausal | No |
838 | Within normal limits | N/A | 51 | Perimenopausal | No |
839 | Simple cyst | N/A | 79 | Postmenopausal | Yes |
842 | Within normal limits | N/A | 70 | Postmenopausal | Yes |
846 | Hemorrhagic corpus luteum | N/A | 51 | Perimenopausal | No |
848 | Within normal limits | N/A | 70 | Postmenopausal | No |
NHS1 | Healthy serum donor | N/A | 36 | Premenopausal | No |
NHS4 | Healthy serum donor | N/A | 34 | Premenopausal | No |
NHS10 | Healthy serum donor | N/A | 37 | Premenopausal | No |
Patient ID . | Ovarian histopathology . | Stage/grade . | Age at surgery . | Menopause status . | Endometriotic cysts present? . |
---|---|---|---|---|---|
242 | Papillary serous carcinoma | IIIc/3 | 63 | Postmenopausal | No |
281 | Papillary serous carcinoma | III/1 | 66 | Postmenopausal | No |
454 | Papillary serous carcinoma | III/3 | 72 | Postmenopausal | No |
458 | Papillary serous carcinoma | IIIc/3 | 59 | Postmenopausal | No |
472 | Papillary serous carcinoma | IIIc/2-3 | 49 | Postmenopausal | No |
473 | Papillary serous carcinoma | IIIc/3 | 48 | Perimenopausal | No |
491 | Papillary serous carcinoma | I/ | 74 | Postmenopausal | No |
495 | Papillary serous carcinoma | IIIb/3 | 43 | Premenopausal | No |
512 | Papillary serous carcinoma | IIIb/2-3 | 59 | Postmenopausal | No |
517 | Papillary serous carcinoma | Ia/3 | 59 | Postmenopausal | No |
526 | Papillary serous carcinoma | IIIc/2-3 | 49 | Postmenopausal | No |
528 | Papillary serous carcinoma | IIIc/3 | 66 | Postmenopausal | No |
529 | Papillary serous carcinoma | IIIc | 67 | Postmenopausal | No |
533 | Papillary serous carcinoma | III/1 | 43 | Premenopausal | No |
537 | Papillary serous carcinoma | IIIa/2-3 | 64 | Postmenopausal | No |
542 | Papillary serous carcinoma | IV/3 | 61 | Postmenopausal | No |
551 | Papillary serous carcinoma | IIIc/IV/3 | 59 | Postmenopausal | No |
559 | Papillary serous carcinoma | IV/3 | 49 | Perimenopausal | No |
588 | Papillary serous carcinoma | IIIc/2-3 | 71 | Postmenopausal | No |
589 | Papillary serous carcinoma | IIIc/IV/3 | 46 | Premenopausal | No |
606 | Papillary serous carcinoma | IIIa/3 | 54 | Premenopausal | No |
617 | Papillary serous carcinoma | IIIc/2-3 | 64 | Postmenopausal | No |
620 | Papillary serous carcinoma | III/IV/3 | 62 | Postmenopausal | No |
632 | Papillary serous carcinoma | IIIb/3 | 65 | Postmenopausal | No |
643 | Papillary serous carcinoma | IIIb/2 | 59 | Postmenopausal | No |
644 | Papillary serous carcinoma | IIIb/1-2 | 46 | Postmenopausal | No |
647 | Papillary serous carcinoma | IIIb-c/3 | 68 | Postmenopausal | No |
651 | Papillary serous carcinoma | IIIb-c/3 | 46 | Perimenopausal | No |
655 | Papillary serous carcinoma | III/IV/3 | 75 | Postmenopausal | No |
659 | Papillary serous carcinoma | IIIc/IV/3 | 78 | Postmenopausal | No |
678 | Papillary serous carcinoma | IV/3 | 59 | Postmenopausal | No |
688 | Papillary serous carcinoma | IIIc/3 | 59 | Postmenopausal | No |
694 | Papillary serous carcinoma | IIIb/3 | 70 | Postmenopausal | No |
704 | Papillary serous carcinoma | III/IV/3 | 75 | Postmenopausal | No |
717 | Papillary serous carcinoma | IIIb/3 | 64 | Postmenopausal | No |
721 | Papillary serous carcinoma | IIIc/1 | 58 | Postmenopausal | No |
756 | Papillary serous carcinoma | IIIc/2 | 59 | Postmenopausal | No |
782 | Papillary serous carcinoma | IIIc/3 | 59 | Postmenopausal | No |
787 | Papillary serous carcinoma | IIIc/3 | 72 | Postmenopausal | No |
821 | Papillary serous carcinoma | IIIc/1-2 | 58 | Postmenopausal | No |
831 | Papillary serous carcinoma | IIIc/ | 69 | Postmenopausal | No |
864 | Papillary serous carcinoma | IIIc/3 | 60 | Postmenopausal | No |
876 | Papillary serous carcinoma | IIa/1 | 63 | Postmenopausal | No |
5010 | Papillary serous carcinoma | IIIc/1 | 59 | Postmenopausal | No |
440 | Within normal limits | N/A | 50 | Perimenopausal | No |
504 | Within normal limits | N/A | 48 | Premenopausal | No |
523 | Serous cystadenoma | N/A | 32 | Premenopausal | No |
534 | Within normal limits | N/A | 72 | Postmenopausal | No |
540 | Within normal limits | N/A | 59 | Postmenopausal | No |
541 | Within normal limits | N/A | 41 | Perimenopausal | No |
544 | Within normal limits | N/A | 49 | Perimenopausal | No |
552 | Within normal limits | N/A | 41 | Premenopausal | No |
612 | Within normal limits | N/A | 48 | Premenopausal | No |
614 | Within normal limits | N/A | 44 | Premenopausal | No |
615 | Within normal limits | N/A | 42 | Perimenopausal | No |
623 | Simple cyst | N/A | 54 | Postmenopausal | No |
627 | Within normal limits | N/A | 59 | Postmenopausal | No |
636 | Within normal limits | N/A | 71 | Postmenopausal | Yes |
650 | Cystic corpus luteum | N/A | 47 | Postmenopausal | No |
677 | Within normal limits | N/A | 68 | Postmenopausal | No |
691 | Within normal limits | N/A | 70 | Postmenopausal | No |
693 | Simple cyst | N/A | 60 | Postmenopausal | No |
697 | Within normal limits | N/A | 51 | Premenopausal | Yes |
698 | Functional cyst | N/A | 49 | Perimenopausal | No |
703 | Simple cyst | N/A | 42 | Premenopausal | No |
719 | Within normal limits | N/A | 55 | Perimenopausal | No |
733 | Within normal limits | N/A | 37 | Postmenopausal | No |
736 | Within normal limits | N/A | 45 | Premenopausal | No |
737 | Within normal limits | N/A | 41 | Perimenopausal | No |
740 | Functional cyst | N/A | 37 | Premenopausal | No |
749 | Simple cyst/cystic follicles | N/A | 56 | Perimenopausal | No |
750 | Serous cystadenoma | N/A | 41 | Postmenopausal | No |
751 | Within normal limits | N/A | 60 | Postmenopausal | No |
752 | Within normal limits | N/A | 74 | Postmenopausal | No |
755 | Within normal limits | N/A | 75 | Postmenopausal | No |
757 | Within normal limits | N/A | 84 | Postmenopausal | No |
759 | Within normal limits | N/A | 52 | Postmenopausal | No |
763 | Hemorrhagic cyst | N/A | 45 | Premenopausal | No |
765 | Within normal limits | N/A | 84 | Postmenopausal | No |
766 | Within normal limits | N/A | 36 | Premenopausal | No |
783 | Within normal limits | N/A | 52 | Premenopausal | No |
790 | Cystic follicles | N/A | 39 | Premenopausal | No |
796 | Within normal limits | N/A | 44 | Premenopausal | No |
808 | Within normal limits | N/A | 35 | Premenopausal | No |
828 | Simple cyst | N/A | 59 | Postmenopausal | No |
829 | Simple cyst | N/A | 33 | Postmenopausal | No |
838 | Within normal limits | N/A | 51 | Perimenopausal | No |
839 | Simple cyst | N/A | 79 | Postmenopausal | Yes |
842 | Within normal limits | N/A | 70 | Postmenopausal | Yes |
846 | Hemorrhagic corpus luteum | N/A | 51 | Perimenopausal | No |
848 | Within normal limits | N/A | 70 | Postmenopausal | No |
NHS1 | Healthy serum donor | N/A | 36 | Premenopausal | No |
NHS4 | Healthy serum donor | N/A | 34 | Premenopausal | No |
NHS10 | Healthy serum donor | N/A | 37 | Premenopausal | No |
Sample preparation
Prior to analysis, 200 μL of each serum sample were thawed on ice and mixed with 1 mL of freshly prepared, chilled (−18°C), and degassed 2:1 (v/v) acetone/isopropanol mixture. The mixture was vortexed and placed in a freezer at −18°C overnight to precipitate proteins followed by centrifugation at 13,000 × g for 5 minutes. The supernatant was transferred to a new centrifuge tube, and the solvent was evaporated in a speed vacuum. The solid residue was redissolved in 25 μL of anhydrous pyridine (EMD Chemicals), and shaken for 1 hour at room temperature for complete dissolution. Fifty microliters of N-trimethylsilyl-N-methyltrifluoroacetamide (Alfa Aesar) containing 0.1% trimethylchlorosilane (Alfa Aesar) were added to the sample in a N2-purged glove box. The mixture was then incubated at 50°C in an inert N2 atmosphere for half an hour, resulting in tri-trimethylsilane derivatization of amide, amine, carboxyl, and hydroxyl groups. The final derivatized mixture was subject to DART MS analysis.
DART-TOF MS
A in-depth characterization of the analytical figures of merit of the DART MS approach used here has been recently reported (9), and therefore, the method is only briefly presented. Serum mass spectrometric analysis was done using a DART ion source (IonSense, Inc.) coupled to a JEOL AccuTOF orthogonal time-of-flight (TOF) MS (JEOL, Inc., Japan). Prior to DART MS analysis, 0.5 μL of derivatized serum solution was pipette-deposited onto the glass end of the Dip-tip applicator (IonSense) coupled to the sampling arm, a 1.2-minute data acquisition run was started, and the sample allowed to air-dry for 0.65 minutes. The sampling arm was then rapidly switched so that the dried sample was exposed to the ionizing zone of the DART ion source. After 0.9 minutes in the acquisition run (0.25-minute sampling time), the sample was removed, and a new Dip-tip placed on the sample holder while the remaining 0.3 minutes of the run was completed. Each sample was run in triplicate.
The DART ion source was operated in positive ion mode with a helium gas flow rate of 3.0 L/min heated to 200°C. The glass tip-end was positioned 1.5 mm below the MS inlet. The discharge needle voltage of the DART source was set to +3,600 V, and the perforated, and grid electrode voltages set to +150 and +250 V, respectively. Accurate mass spectra were acquired in the range of m/z 60 to 1,000 with a spectral recording interval of 1.0 seconds, and an RF ion guide peak voltage of 1,200 V. The settings for the TOF MS were as follows: ring lens, +8 V; orifice 1, +40 V; orifice 2, +6 V; orifice 1 temperature, 80°C; and detector voltage, −2,800 V. Mass drift compensation was done after analysis of each sample using a 0.20 mmol/L polyethylene glycol 600 standard (PEG 600, Fluka Chemical Corp.) in methanol. The measured resolving power of the TOF MS was 6,000 at full-width, half-maximum, with observed mass accuracies in the range of 2 to 20 ppm, depending on the signal-to-noise ratios of the particular peak investigated. A repeatability of 4.1% to 4.5% was obtained for the total ion signal using a manual sampling arm.
Data preprocessing
All profile mass spectra were obtained by time-averaging of the total ion chronogram between 0.73 and 0.76 minutes after each injection, which corresponds to the part of the time-varying signal that is conducive to the maximum number of analytes detected and identified with good sensitivity (9). Following DART-TOF MS data collection and mass drift compensation, the background spectrum was subtracted and profile spectral data were exported in JEOL-DX format and converted to a comma-separated format prior to importing in MATLAB 7.6.0 (R2008a, MathWorks). The resulting data were normalized to a relative intensity scale and re-sampled to a total of 20,000 features between m/z 60 and 990 using the msresample function in the Matlab Bioinformatics Toolbox (10). The three replicate DART spectra were then averaged. The original data set containing the DART-MS data can be downloaded (11).
Data analysis
Evaluation framework.
The prediction performance of the data set was evaluated through a 64-30 split validation without feature selection, and with different feature selection methods. Following this approach, further evaluation of the classifier performance was done through leave-one-out cross-validation (LOOCV). In each case, the chosen feature selection method was applied only to the training data set, and then the prediction performance of the selected feature subset on the test data set was measured.
SVM (12) analysis of averaged DART mass spectra was done using libSVM (13). fSVM analysis was done using the functional data analysis package (14) and libSVM. Partial least squares discriminant analysis was done using the PLS Toolbox (version 4.1, Eigenvector Research). We implemented ANOVA, recursive feature elimination (15), and Weston's (16) feature selection methods in Matlab 7.6.0. Mangasarian's L1-norm SVM (17, 18) was also implemented in Matlab.
fSVM.
In some application domains such as chemometrics, it is well known that the shape of a spectrum is sometimes more important than its actual mean value. Therefore, it is beneficial to view the intensity as functions of m/z values, and perform functional classification. The goal of functional classification (19) is to predict the label y of a functional data instance X given training data , where the input functional data instance Xi is a random variable that takes values in an infinite dimensional Hilbert space H, the space of functions.
In practice, the functions that describe the input data instance X1, …, XM are never perfectly known. Often, N discretization points have been chosen in t1, …, tN ∈ R, and each functional data instance Xi is described by a vector . Sometimes, the functional data instances are badly sampled and the number and the location of discretization points are different between different instances. The usual solution under this context is to construct an approximation (such as B-spline interpolation) for each instance of Xi based on its observation values, and then apply sampling uniformly to the reconstructed functional data (20). Therefore, a simple solution is to apply the standard SVM to the vector representation of the functional data. With the introduction of functional transformations and functional kernels, it has been suggested recently that the classification accuracy could be improved by designing SVMs specifically for functional classification, with the introduction of functional transformations and function kernels (21, 22):
Apply functional transformation, projection , on each instance Xi as with Xi approximated by , where is a complete orthonormal basis of the functional space H.
Build a standard SVM on the coefficients xi ∈ RN for all i = 1, …, M.
This procedure is equivalent to working with a functional kernel, KN (xi, xj) defined as , where denotes the projection onto the N-dimensional subspace VN ∈ H spanned by , and K denotes any standard SVM kernel.
Good candidates for the basis functions include the Fourier basis and wavelet basis. If the functional data are known to be nonstationary, a wavelet basis might yield better results than the Fourier basis (20). Other suitable choices include B-spline bases, that generally perform well in practice (21).
Feature selection.
The ANOVA is one of the most commonly used filter-based feature selection methods in bioinformatics. It helps to identify the features that highlight differences between groups (23, 24). Let the data set S contain c classes (groups), n data instances, and ni instances from each class ci; Xij (i = 1, …, c; j = 1, …, ni) be a random sample of size ni from a population with mean ui. ANOVA is used to investigate the null hypothesis H0: u1 = u2 = …. = uc through F-test , where is the interclass sum of square, is the total intraclass sum of squares, and are estimates of class and overall sample means, respectively; xij is an observation (sample) from class ci. If the null hypothesis is rejected [f > Fc-1,n−c (α)], the upper 100αth percentile of the F distribution with c-1 and n-c degrees of freedom), this implies that the groups of data samples differ significantly.
Elemental formula determination and metabolic database matching
Features in the fSVM model using 1:7:20,000 subsampling were assigned elemental formulae and tentatively matched to metabolites by finding the closest mass spectral peak matching the model features in the 103 to 714 m/z range. This m/z range was chosen because it is fully covered by the TOF calibration function thus providing the most reliable accurate masses. No attempt was made to match fSVM model features outside this range. Accurate masses were searched against a custom-built database containing 2,924 entries corresponding to elemental formulae of endogenous human metabolites in the Human Metabolome Database (25). Each entry was manually expanded to take into account the mono-trimethylsilane, di-trimethylsilane, and/or tri-trimethylsilane derivatives. Entries for families of compounds not reacting with the N-trimethylsilyl-N-methyltrifluoroacetamide/trimethylchlorosilane reagent mixture were not expanded. Matching of database records to experimental DART MS data was done using the SearchFromList application part of the Mass Spec Tools suite of programs (ChemSW) using a tolerance of 10 mmu to obtain candidate elemental formulae. If no matches were found, the next closest match within 20 mmu was selected.
Results
Metabolic profiles can distinguish between ovarian cancer and control samples
Serum samples were obtained from 44 women diagnosed with serous papillary ovarian cancer (stages I-IV) and 50 healthy women or women with benign conditions (e.g., serous, simple, or follicular cysts; Table 1) and subjected in triplicate to DART MS profiling.
We have previously tested a customized SVM algorithm for the classification of metabolic profiles obtained by using liquid chromatography-MS (26). In this study, the classification procedure builds on our previous work and can be briefly described as follows: (a) the data are collapsed along the desorption time dimension by using the average value within the time range of interest for all mass spectral m/z values (“features”); (b) the resulting feature vector is smoothed using B-splines (12, 27) to create the functional representation; (c) the vector of spline coefficients is then used by the SVM (17). To deal with the very large number of features (20,000 m/z values per serum sample run), the data were subjected to a variety of data analysis methods including SVM (the above described nonlinear fSVM as well as the standard linear SVM; ref. 28), and partial least squares discrimination analysis (29, 30). Classification was done either with all mass spectral features, or with feature subsets selected by simple subsampling (15, 16).
We evaluated the efficacy of our classifiers by two separate approaches. In both approaches, a “training set” of samples was used to build the predictive model and a separate and independent set of samples (“test set”) was used to evaluate the predictive/discriminatory power of the model, and the best model parameters to avoid overfitting or underfitting. A summary of the results obtained with fSVMs is presented in Table 2. In the first approach, we used a training set of 64 patients selected at random, with the 30 remaining patients treated as an independent test set (64-30 split validation). In this case, the fSVM with linear kernels applied to a feature subset selected by one-way ANOVA (P ≤ 0.05) achieved 100% accuracy (100% sensitivity and 100% specificity; Table 2A; Fig. 2A).
Results from the analysis of DART MS ovarian cancer data using fSVMs
(A) . | |||||
---|---|---|---|---|---|
Classifier type . | Feature selection method . | No. of features . | SENS (%) . | SPEC (%) . | ACC (%) . |
fSVM | One-way ANOVA | 3,017 | 100 | 100 | 100 |
(B) | |||||
fSVM | One-way ANOVA | 4,390 | 100 | 98 | 98.9 |
(A) . | |||||
---|---|---|---|---|---|
Classifier type . | Feature selection method . | No. of features . | SENS (%) . | SPEC (%) . | ACC (%) . |
fSVM | One-way ANOVA | 3,017 | 100 | 100 | 100 |
(B) | |||||
fSVM | One-way ANOVA | 4,390 | 100 | 98 | 98.9 |
NOTE: ANOVA feature selection in combination with fSVM was first applied to the training data set and then the test set predicted using the selected features subset: (A) 64-30 split validation, (B) LOOCV evaluation. The sensitivity (SENS), specificity (SPEC), and accuracy (ACC) were determined as follows: SENS, true positive (TP)/TP + false negatives (FN); SPEC, true negative (TN)/TN + false positives (FP); ACC, (TP + TN)/(TP + FN + TN + FP).
Classifier visualization using linear kernel fSVMs. A, visualization of the data set following 64-30 split validation with ANOVA feature selection (P ≤ 0.05). B, visualization of one iteration of LOOCV using 1:7:20,000 subsampling. The X-axis is the optimal weight vector of the fSVM model (red triangles with black edges correspond to ovarian cancer patients in the training set, green circles with black edges to controls in the training set, larger red triangles without borders are cancer patients in the test set, and the green circles without borders are the control samples in the test set).
Classifier visualization using linear kernel fSVMs. A, visualization of the data set following 64-30 split validation with ANOVA feature selection (P ≤ 0.05). B, visualization of one iteration of LOOCV using 1:7:20,000 subsampling. The X-axis is the optimal weight vector of the fSVM model (red triangles with black edges correspond to ovarian cancer patients in the training set, green circles with black edges to controls in the training set, larger red triangles without borders are cancer patients in the test set, and the green circles without borders are the control samples in the test set).
Although the above approach is considered the “gold standard” in evaluating diagnostic tests (4), LOOCV is a more rigorous approach to model parameter estimation due to its maximal usage of the data for training (31, 32). During LOOCV, each training set consisted of all patient samples except for one “left out” sample that is tested. In this way, each one of the patient samples is sequentially treated as an unknown, classified by the model as “cancer” or “control” in a blind fashion, and the accuracy of each classification evaluated. While validating models by LOOCV, feature selection was done independently on 94 different 94-1 (n-1) split validations. Only one 94-1 split validation resulted in a misclassification giving an overall accuracy of 98.9% (100% sensitivity and 98% specificity; Table 2B; Fig. 2B). For comparison purposes, standard SVMs and partial least squares discrimination analyses were also used to classify the data. The corresponding results are presented in Supplementary Table S1. All classification and feature selection methods showed high accuracy owing to the inherent discriminative power of the data, but the best performance was obtained using fSVM.
Pathway enrichment analysis and metabolic network building
The MetaCore 5.2 (GeneGO) software suite was used for metabolic network analysis. One hundred and fifty-three estimated elemental formulae (Supplementary Table S2) obtained by DART MS accurate mass measurements of differentiating spectral features were assigned to 385 network objects by the metabolic network analysis software, of which 299 represented unique endogenous metabolites or xenobiotic compounds (33). Metabolic compounds assigned to these elemental formulae by MetaCore were mapped onto GeneGO canonical metabolic pathways that were ranked according to their relevance to the input set using P values calculated based on hypergeometric distribution. These differentiating compounds mapped onto 25 pathways (34) with P < 0.01 (Supplementary Fig. S1), suggesting differences between cancer and noncancer groups in amine, amino acid, eicosanoid, and TTP metabolisms. Suggested differences in the metabolisms of carbohydrates and androgens/estrogens have lower confidence because the relevance of corresponding pathways was determined from ambiguously identified metabolites (e.g., several different hexoses corresponding with elemental formula C6H12O6) and were not further examined.
Discussion
Potential biological significance of metabolic changes in ovarian cancer
A considerable proportion of the differentiating metabolites identified during the development of our assay represents components of the histamine pathway (Supplementary Figs. S2 and S5). Serum histamine levels have also been reported to be altered in breast cancer (35). Histamine is known to serve as a receptor-dependent growth factor in some colon, gastric, breast cancer, and melanoma cell lines and to inhibit lymphocyte responsiveness via proliferation and activation of T lymphocyte suppressor cells (36). In addition, the relationship of histamine with the metabolism of nitric oxide, polyamines, and angiogenesis is an emerging area of interest in cancer biology (37). The overrepresentation of members of the histamine pathway in our metabolic panel suggests that these species might also be of functional importance in ovarian cancer.
Other pathways overrepresented in our data set suggest that changes in the metabolism of several amino acids (e.g., glycine) involved in the de novo synthesis of purine nucleotides are also altered in ovarian cancer. Glycine, serine, and sarcosine were all tentatively identified as differentiating metabolites in our study and these metabolites are components of overrepresented canonical pathways of alanine, serine, cysteine, threonine, and glycine metabolisms (Supplementary Figs. S3 and S5). Several amino acids from these pathways have previously been identified in an earlier MS-based metabolic profile of ovarian cancer tissues (38). Sarcosine, the N-methyl derivative of glycine, is elevated in invasive prostate cancer cell lines and in the tumors and urine of patients with metastatic prostate cancer (39). Also consistent with our findings, the levels of these amino acids have all been previously reported to be elevated in the sera of patients with colorectal (40), lung, and breast (41) cancers.
A number of other tentatively identified metabolites (e.g., dopamine, tyramine, 5-hydroxykynurenamine, and 1,2-dehydrosalsolinol) which are differentially expressed in the sera of ovarian cancer relative to control patients are all products of decarboxylation of their precursor amino acids catalyzed by aromatic l-amino acid decarboxylase (DDC). This enzyme and its metabolic products have previously been shown to be elevated in neuroendocrine neoplastic tissues (carcinoid, small cell lung cancer; ref. 41). We have recently reported that DDC is overexpressed in ovarian cancer (42). Networks built from our metabolic data set using dopamine, tyramine, 5-hydroxykynurenamine, and 1,2-dehydrosalsolinol and their precursors (Supplementary Figs. S4 and S5) are consistent with the finding that DDC (and its metabolic products) is (are) differentially expressed in ovarian cancer.
The utility of metabolic profiling as a diagnostic test for ovarian cancer
Previous efforts to discover more accurate biomarkers of ovarian cancer using MS have generally focused on large biopolymers, such as proteins (43). However, finding and validating biomarkers of this kind has been plagued by the fact that the serum proteome is extremely complex, comprising ∼2 × 106 protein species with a dynamic range spanning 10 orders of magnitude (44). This inherent complexity, combined with current limitations in the proteomic analytical toolbox, could result in the convolution of biomarker variability with nonbiological sources of variance. Comprised of ∼2,500 molecules with molecular weights of <1,000 Da, the known components of the serum metabolome could readily be distinguished from the serum proteome and more thoroughly investigated (45). As biological studies using more sensitive analytical tools with higher peak capacities improve our understanding of the serum metabolome, the number of detected and identified metabolites is expected to progressively increase, enriching the biological significance of discriminating spectral features useful in diagnostics.
MS analysis of serum samples typically employs chromatographic separation. This step is usually time-consuming and could result in increased costs and memory effects, which we believe was one of the confounding factors in our previous liquid chromatography-MS study (26). Our DART method circumvents chromatographic separation, making use of direct ionization without a matrix in a noncontact fashion. This decreases cross-contamination between experiments, enabling a better detection of differences between disease and control groups. Moreover, DART is able to ionize a broad range of metabolites with varying polarities (46), allowing simultaneous interrogation of multiple chemical species at minimal cost.
By combining the DART-TOF MS with a customized fSVM classification algorithm, we were able to distinguish sera from cancer patients and controls with 100% accuracy as estimated by the 64-30 split validation test, as well as 99% accuracy using the more stringent LOOCV test (100% sensitivity and 98% specificity). In this study, the use of high-resolution TOF MS was necessary for metabolite identification purposes, but the spectral data were later down-sampled for machine-learning purposes, suggesting that approaches similar to the one presented here, but based on low-resolution MS data acquisition, might also be conducive to high discriminatory power.
There is a general consensus among the ovarian cancer community that to be of clinical significance, a diagnostic test for ovarian cancer must have a minimum positive predictive value of ∼10% (47). Because the prevalence of ovarian cancer in the general population is low (∼0.04%), the accuracy of any potential screening test to be used in the general population must be extremely high (∼100%; ref. 3). Although our results indicate that our approach has great potential as a diagnostic tool of clinical significance, more extensive testing will be required to define its use in screening applications. Other, more immediate clinical applications of our assay may be in those subpopulations of women in which the prevalence of ovarian cancer is known to be relatively high. For example, the estimated incidence of ovarian cancer in women ages 20 and over with two first-degree relatives with ovarian cancer is 0.266% (48). Using incidence to approximate prevalence (49), we estimate a clinically significant 12% positive predictive value for our assay in this subpopulation, assuming the more stringent LOOCV values of 100% sensitivity and 98% specificity. Women 20 years of age and over who test positive for BRCA1 or BRCA2 are reported to have an incidence of ovarian cancer as high as 0.683% (50). For this group of women, our assay would have an estimated positive predictive value of 26%—well above the minimum value (∼10%) for a test to be considered of clinical significance.
The results presented here show the potential application of our method as an ovarian cancer diagnostic of significant clinical value. In addition, if future studies establish that metabolic profiles of different cancers and other diseases are sufficiently distinct, our method might have the added advantage that it could be used to rapidly and inexpensively test for multiple diseases from a small serum sample.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Acknowledgments
Grant Support: Georgia Research Alliance/VentureLabs program (F.M. Fernández and J.F. McDonald), a Blanchard Professorship (F.M. Fernández), the Deborah Nash Harris Endowment Fund (J.F. McDonald), Northside Hospital (Atlanta; J.F. McDonald and B.B. Benigno), The Ovarian Cycle Fund (J.F. McDonald and B.B. Benigno), and the Robinson Family Foundation Fund (J.F. McDonald).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.