Abstract
Purpose: Scoring proliferation through Ki67 immunohistochemistry is an important component in predicting therapy response to chemotherapy in patients with breast cancer. However, recent studies have cast doubt on the reliability of “visual” Ki67 scoring in the multicenter setting, particularly in the lower, yet clinically important, proliferation range. Therefore, an accurate and standardized Ki67 scoring is pivotal both in routine diagnostics and larger multicenter studies.
Experimental Design: We validated a novel fully automated Ki67 scoring approach that relies on only minimal a priori knowledge on cell properties and requires no training data for calibration. We applied our approach to 1,082 breast cancer samples from the neoadjuvant GeparTrio trial and compared the performance of automated and manual Ki67 scoring.
Results: The three groups of autoKi67 as defined by low (≤15%), medium (15.1%–35%), and high (>35%) automated scores showed pCR rates of 5.8%, 16.9%, and 29.5%, respectively. AutoKi67 was significantly linked to prognosis with overall and progression-free survival P values POS < 0.0001 and PPFS < 0.0002, compared with POS < 0.0005 and PPFS < 0.0001 for manual Ki67 scoring. Moreover, automated Ki67 scoring was an independent prognosticator in the multivariate analysis with POS = 0.002, PPFS = 0.009 (autoKi67) versus POS = 0.007, PPFS = 0.004 (manual Ki67).
Conclusions: The computer-assisted Ki67 scoring approach presented here offers a standardized means of tumor cell proliferation assessment in breast cancer that correlated with clinical endpoints and is deployable in routine diagnostics. It may thus help to solve recently reported reliability concerns in Ki67 diagnostics. Clin Cancer Res; 21(16); 3651–7. ©2014 AACR.
Clinical decision-making in breast cancer increasingly depends on the quantitative evaluation of molecular and immunohistochemical markers, such as Ki67, for tumor proliferation assessment. However, recent studies have shown a limited reliability of the established manual Ki67 scoring in breast cancer diagnostics in a multicenter setting. As a solution, we present a novel automated Ki67 scoring system, which we validate using data from the prospective neoadjuvant GeparTrio trial. Our approach is readily deployable in routine diagnostics and may thus help solve the recently reported reliability issues in Ki67 scoring.
Introduction
The prognostic and predictive utility of the proliferation marker Ki67 (1) has been studied extensively in breast cancer (2–8). Unlike other immunohistochemical markers, such as ER and PR, Ki67 requires a more precise quantification to stratify patients into different risk groups, which is usually done manually in routine diagnostics. However, Ki67 scoring by visual inspection has been shown to suffer from a considerable inter- and intraobserver variability (9, 10), which is likely partly due to the fact that only a limited number of cells are evaluated in routine diagnostics. In this context, automated image analysis may help to streamline and standardize Ki67 scoring, for which several approaches have been proposed in the recent years by academic groups (11–13) and companies. However, the previous approaches have not been validated using a prospective–retrospective trial design considered necessary to reach a sufficient level of evidence to determine clinical utility when validation studies rely on archived tissue samples (14). Here, we present a novel robust and easy to use Ki67 scoring approach that follows the International Ki67 in Breast Cancer Working group guidelines (9) and is based on a cell detection method and image analysis framework we recently developed (15, 16). We validate our approach by applying it to 1,082 breast cancer cases from the prospective neoadjuvant GeparTrio trial (17, 18).
Patients and Methods
Patient characteristics and treatment in the GeparTrio breast cancer cohort
Overall, 2,090 patients with breast cancer had been recruited in the GeparTrio breast cancer cohort who had received no prior treatment. A total of 2,072 patients started chemotherapy with TAC (docetaxel–doxorubicin–cyclophosphamide), out of which 1,166 tissue specimens were available here. For detailed patient characteristics, see Supplementary Table S1.
Specimen characteristics
Core-needle biopsies of breast tumors from patients acquired in a routine diagnostic setting that were formalin-fixed and paraffin-embedded were used in this study. Tissue specimens were fixated at room temperature in 10% neutral-buffered formalin immediately after biopsies were acquired (maximum time to fixation 3 minutes). Time of fixation was 16 to 72 hours. Subsequently, specimens were paraffin-embedded and further processed on the same or the following day.
Assay methods
Immunohistochemical staining and image acquisition.
A total of 1,166 digital images were taken from Ki67 stained whole core biopsies using a conventional light microscope (Leica DMRB and Olympus BX51) and microscope camera (both JVC KY-F75U). Ki67 immunohistochemistry was performed using clone MIB-1, 1:50, DAKO on an autostainer (Ventana), using the standard protocol “xt ultraview dab v3.” After visual inspection by an experienced pathologist, 84 images were considered not suitable for automated image analysis because of poor specimen or image quality.
Automated image analysis
The automated image analysis approach for computer-assisted Ki67 scoring we present here is based on a cell segmentation method developed in (16), of which an implementation is available in the open source image analysis framework CognitionMaster (15). To extend the approach to Ki67 immunohistochemistry assessment the following modifications were necessary: While the original cell detection imposed very loose restrictions on cell nucleus sizes, we improved breast cancer cell detection by estimating the range of nucleus sizes in a presegmentation (0.95 percentile of the area of convex objects is used as target size). Moreover, we replaced Otsu method (19) previously used to determine the threshold to distinguish cell nuclei from image background, because it is based on the concept that an image is essentially separable into two intensity classes. However, typical histologic images are characterized by the presence of more than two intensity classes, such as no tissue, tissue but no cell nuclei, Ki67-negative cell nuclei and Ki67-positive cell nuclei each showing a certain hematoxylin intensity (Fig. 1 and Supplementary Fig. S1). In the modified cell nucleus detection method presented here, initially, all objects with a minimum hematoxylin intensity and positive intensity gradient (objects with higher hematoxylin intensity compared with the adjacent objects) are selected. Consecutively, the histograms for the (temporary) image background and foreground are computed (Fig. 1 and Supplementary Fig. S1C). Then, objects are iteratively removed from the foreground that fit more into the image background. Finally, pixels that were not detected by the original method, but can be associated with a cell nucleus via the background and foreground histogram are set to “foreground” and are then grouped via connected component analysis. To classify cell nuclei as Ki67-negative or -positive, first, the immunohistochemical stain and counterstain are separated through color deconvolution (20). Consecutively, the optimal threshold is computed by testing all possible thresholds (1.255). The F1 score (the harmonic mean of precision and recall) and a plausibility score are used to evaluate each threshold and finally the threshold with the best combination of F1 score and plausibility is used. For a more detailed description of the threshold finding method, please see Supplementary Materials and Methods online. Finally, a cell nucleus is classified as Ki67 positive if at least 5% of the nucleus area is above the threshold. Another important aspect of Ki67 scoring is the classification of cell nuclei in tumor and nontumor. Here, we use the MaximumDistanceToBorder feature available in the CognitionMaster framework to distinguish between tumor and stromal cells. To this end, the 0.95 percentile of all cell nuclei is used to define the target value typical of tumor cells and objects below 33% of the target value are marked as nontumor. Furthermore, objects that have a value smaller than 50% of the target value and an aspect ratio lower than 0.5 are marked as nontumor, too. Please note that this routine does not classify all cell nuclei into tumor and nontumor correctly. However, first, regions can be interactively refined by the pathologist to exclude problematic region, if necessary, but more importantly, note that our automated Ki67 scoring validation was successful without any such manual intervention.
Manual and automated Ki67 scoring
The manual Ki67 scoring was performed by one out of three pathologists (central counting) for each tumor who examined the complete slide of the core biopsy and who selected one area for exact counting. In cases with a heterogeneous distribution of Ki67, an area with the highest expression (the so-called hot spot) was selected. The selected area was documented in a microphotograph, and this microphotograph was used for both exact manual counting of 200 adjacent invasive tumor cells (compare ref. 3) as well as for automated image analysis. Already a weak nuclear staining was regarded sufficient for Ki67 positivity. For the automated scoring, a software tool based on the image analysis method described in the previous section was implemented using the CognitionMaster platform that may be used in single- or multicase mode. Multicase batch processing was performed for 1,082 out of 1,166 Ki67-stained immunohistologic images (84 images were excluded due to poor image quality). These images were identical to those that we used for manual scoring acquired at 200× and a resolution of 1,240 × 800 pixels by pathologists using a standard microscope–microscope camera set-up. The software automatically sequentially loads the image files and saves visual and numeric result representations. Processing time was 1,502 minutes for all 1,082 images, with an average processing time of 1 minute 23 seconds per image. All processing shown here was performed on a standard PC running Microsoft Windows 7.
Study design
GeparTrio breast cancer study cohort.
The GeparTrio trial (NCT00544765) is a prospective randomized phase III study comparing neoadjuvant vinorelbine–capecitabine versus docetaxel–doxorubicin–cyclophosphamide chemotherapy in early nonresponsive breast cancer. Overall, 2,090 patients were centrally confirmed, and 2,072 patients started chemotherapy with TAC. For overall 1,166 patients, Ki67 immunohistochemistry was assessed in core biopsies. Clinical endpoints were pathologic complete response (pCR, defined as no residual histologic evidence of tumor after chemotherapy at the time of surgery), overall survival (OS, defined as survival time after diagnosis), and disease-free survival (DFS, defined as time interval from therapy to disease recurrence). The multivariate analysis included variables age, stage, nodal status, histology, grade, hormone receptor, and Her2 status. Sample size was defined by the original GeparTrio trial; here, all patients were included for whom tissue samples were available. Clinical data on pathological complete response (pCR) was available for 1166 patients. Clinical data on disease-free and OS was available for 1101 patients. Patients had given prior consent for tissue collection and biomarker analysis. Clinical follow-up data has been updated since the previous report on the GeparTrio cohort in (3). For the Consort statement see Fig. 2, further details see supplement.
Statistical analysis methods and Ki67 cutoff selection.
A statistical analysis plan was prepared prior to the start of this study and the pathologists performing the scoring analysis were blinded for the clinical data. Clinical data were available only for the collaborators at the German Breast Group to whom the scoring results were sent for further analysis: Time-to-event outcome parameters were estimated using Kaplan-Meier product-limit method and treatment groups were compared by log-rank test. Cox proportional hazards models were used to calculate hazard ratios (HR), including 95% confidence intervals (CI). All statistical tests were two-sided by default. A χ2 test was used for trends for pCR rates. Multivariate analysis adjusted for age (<>40 years), cT1-3 versus cT4, cN0 versus CN1, lobular histology versus others, G1–2 versus G3, HER2, and HR status. For consistency purposes and to ensure comparability, we chose the same cut-off points that were previously used for the manual analysis (3), where they had been determined in a systematic way using the Cutoff Finder algorithm (21) based on the results for pCR, DFS, and OS as well as results from previous studies, including the St. Gallen recommendations (4, 22, 23). Cases were grouped as ≤15% Ki67-positive tumor cells (low), 15.01% to 35% (intermediate), and >35% (high). Statistical, and in particular survival analysis, was performed using the SPSS Statistics software package (Version 21, IBM). See Supplementary Data for the detailed statistical analysis plan (SAP).
Results
Baseline clinical data
The clinicopathologic characteristics of the study cohort are shown in Supplementary Table S1 (see also refs. 17, 18). The median follow-up was 82.4 months for DFS and 91.0 months for OS. For the flow of patients through the study, see Consort statement in Fig. 2.
An exemplary visualization of the automated Ki67 scoring results is shown in Fig. 3.
Correlation between manual and automated Ki67 scores and intraclass correlation
We performed computer-assisted automated Ki67 scoring based on digitized images, for which manual Ki67 scores were determined previously. The results show a correlation between manual and automated Ki67 scores of rPearson = 0.89 (Fig. 4). The intraclass correlation between automated and manual scores was rICC = 0.93 (P < 0.001).
Comparison of tumors with low, intermediate, and high Ki67 expression for pCR, DFS, and OS by univariate and multivariate analyses
The pCR rates of the groups with low, intermediate, and high Ki67 were 5.9%, 15.6%, and 31.6% (manual: 3.3%, 13.0%, and 29.6%; P < 0.001, χ2 test for trends), the ORs of the three groups were 1.00, 2.95, and 7.36 (manual: 1.0, 4.4, and 12.4; P < 0.001, logistic regression). These findings remain statistically significant after adjusting for clinicopathologic parameters (P < 0.001, logistic regression).
For autoKi67, DFS and OS were also significant on univariate (DFS: P = 0.00011; OS: P = 0.00005; Fig. 5) and multivariate analysis (DFS: P < 0.01; OS: P = 0.002) for the three groups. Corresponding significance levels for manual Ki67 scores in univariate analysis were P < 0.0001 (DFS) and P < 0.0005 (OS) and for multivariate testing P < 0.01 (DFS) and P < 0.01 (OS). HRs for autoKi67 low, intermediate, and high were HR(DFS) = 1, 1.6, 1.8 and HR(OS) = 1, 1.9, 2.2. Corresponding HRs for manual scoring for DFS were HR(DFS) = 1, 1.7, 1.9 and for OS HR(OS) = 1, 1.5, and 2.2, respectively (see Table 1 for confidence intervals).
. | Auto . | Manual . | ||||||
---|---|---|---|---|---|---|---|---|
All cases (N = 1,082) . | DFS (95% CI) . | HR (DFS) . | OS (95% CI) . | HR (OS) . | DFS (95% CI) . | HR (DFS) . | OS (95% CI) . | HR (DFS) . |
Ki67 <15% | 7.2 (6.9–7.4) | 1 | 7.9 (7.7–8.1) | 1 | 7.3 (7.0–7.6) | 1 | 7.9 (7.7–8.1) | 1 |
Ki67 15%–35% | 6.3 (6.0–6.7) | 1.6 (1.2–2.1) | 7.1 (6.7–7.4) | 1.9 (1.3–2.7) | 6.4 (6.1–6.7) | 1.7 (1.3–2.3) | 7.4 (7.1–7.6) | 1.5 (1.0–2.3) |
Ki67 >35% | 6.4 (6.0–6.8) | 1.8 (1.3–2.4) | 7.2 (6.8–7.6) | 2.2 (1.5–3.3) | 6.4 (6.1–6.8) | 1.9 (1.4–2.6) | 7.2 (6.8–7.5) | 2.2 (1.5–3.3) |
P (univariate) | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
P (multivariate) | — | <0.01 | — | <0.002 | — | <0.01 | — | <0.007 |
. | Auto . | Manual . | ||||||
---|---|---|---|---|---|---|---|---|
All cases (N = 1,082) . | DFS (95% CI) . | HR (DFS) . | OS (95% CI) . | HR (OS) . | DFS (95% CI) . | HR (DFS) . | OS (95% CI) . | HR (DFS) . |
Ki67 <15% | 7.2 (6.9–7.4) | 1 | 7.9 (7.7–8.1) | 1 | 7.3 (7.0–7.6) | 1 | 7.9 (7.7–8.1) | 1 |
Ki67 15%–35% | 6.3 (6.0–6.7) | 1.6 (1.2–2.1) | 7.1 (6.7–7.4) | 1.9 (1.3–2.7) | 6.4 (6.1–6.7) | 1.7 (1.3–2.3) | 7.4 (7.1–7.6) | 1.5 (1.0–2.3) |
Ki67 >35% | 6.4 (6.0–6.8) | 1.8 (1.3–2.4) | 7.2 (6.8–7.6) | 2.2 (1.5–3.3) | 6.4 (6.1–6.8) | 1.9 (1.4–2.6) | 7.2 (6.8–7.5) | 2.2 (1.5–3.3) |
P (univariate) | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
P (multivariate) | — | <0.01 | — | <0.002 | — | <0.01 | — | <0.007 |
Discussion
In this study, we present a novel computer-assisted Ki67 scoring method for breast cancer. We validate the approach in a large cohort of breast carcinomas from a prospective neoadjuvant clinical trial and compare its performance to manual scoring results presented earlier (3). While several automated Ki67 evaluation approaches have been proposed (e.g., refs. 11–13), our approach is the first whose performance is compared to thorough manual scoring and evaluated using tissue samples from a large prospective clinical trial. Because of the size of the trial of over 1,000 patients for whose tumors manual and automated Ki67 scoring were performed, a broad variety of breast cancer samples is covered demonstrating the robust performance of our approach across morphologies. Unlike existing methods, we designed our automated Ki67 scoring in a way that no training, calibration, or user interaction is needed. The approach by Mohammed and colleagues (13), for instance, requires observer-specified (fixed) intensity thresholds to classify nuclear Ki67 staining, while Konsti and colleagues (12) use a similar method that requires the user to define intensity thresholds also to classify hematoxylin into cell nuclei and background. However, defining parameters prior to analysis is very susceptible towards morphologic variability typical of many tumors or technical variation and therefore requires constant adjustment. Other approaches for automated immunohistology evaluation such as, for instance, described or applied in (11, 24, 25) rely on training of certain cell and tissue features, a procedure that is usually time consuming and often requires retraining of the method before analyzing new images. Our approach is capable of providing reliable results without any training, calibration, or interaction performed by the user, because we combine very generic assumptions on cell features with automated adaptive threshold search to detect Ki67-positive and negative cells. This minimizes detection biases due to variation of morphologic features or staining variation between cases. Selecting or excluding certain tissue regions or cells is optional, but was not used in this study.
Despite the described technical advantages over previous approaches, it is neither the intention nor within the scope of this study to show that our approach performs better than other automated scoring systems. The aim of this study was to present and validate an easy-to-use automated Ki67 scoring method in a prospective–retrospective clinical study approach using samples from a large prospective clinical trial. This validation, that has to our knowledge not been performed with any other automated Ki67 analysis system, is, in our opinion, essential to show the clinical utility and to promote reliable quantitative histologic diagnostics.
Interestingly, the overall results are relatively similar between manual and automated approaches, which might appear surprising given the high variability of manual Ki67 scoring reported previously (10). However, this is likely due to the fact that the manual scoring results were obtained by using a standardized stringent monocentric evaluation with exact counting of cells rather than semiquantitative estimations. Therefore, automated Ki67 scoring can be expected to perform more robustly than manual approaches in routine diagnostics where a considerable intra- and interobserver variability has been observed for manual scoring (9). But even if manual scoring performed similarly also in routine settings, would our automated approach have several advantages over manual scoring: Whereas exact manual counting of about 500 cells takes between about 6 and 10 minutes, our approach requires about 30 to 40 seconds to score Ki67 in a typical routine diagnostic field of view. Similarly, the automated system is capable of evaluating the complete cohort of over 1,000 cases within a day, which is not feasible manually.
As a caveat, despite the fact that the international Ki67 consortium has shown that interobserver variation is one main reason for the limited reproducibility of Ki67 scoring (9) it is certainly not the only cause and automated methods will not be the ultimate solution. Selection of representative tumor regions, for instance, particularly in tumors showing a heterogeneous Ki67 expression with “hot spots” or strong variability in staining protocols will need to be addressed separately. However, the reduction of observer-related variation is an important step and automated methods are also necessary to study questions like tumor heterogeneity, because large amounts of image data need to be processed in a standardized manner to address such questions.
Regarding the broader use of our approach in routine diagnostics, recent implementation at other sites also appears to be confirming the robustness of our method, and another systematic study addressing automated Ki67 scoring performance under routine diagnostic conditions, including the analysis of samples processed in different laboratories, is currently under way.
To summarize, our approach allows for a robust and standardized fully automated Ki67 scoring that may therefore contribute to a solution to the widely criticized variations observed in current manual evaluation that have fundamentally put the utility of Ki67 as a prognostic and predictive biomarker in breast cancer diagnostics into question.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: F. Klauschen, B. Gerber, J.-U. Blohmer, M. Dietel, C. Denkert, G. von Minckwitz
Development of methodology: F. Klauschen, S. Wienert, B. Gerber, C. Denkert
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): W.D. Schmitt, S. Loibl, B. Gerber, J.-U. Blohmer, J. Huober, E. Erbstößer, K. Mehta, C. Denkert, G. von Minckwitz
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): F. Klauschen, S. Wienert, B. Gerber, J. Huober, K. Mehta, B. Lederer, C. Denkert, G. von Minckwitz
Writing, review, and/or revision of the manuscript: F. Klauschen, S. Wienert, W.D. Schmitt, S. Loibl, B. Gerber, J.-U. Blohmer, J. Huober, E. Erbstößer, B. Lederer, M. Dietel, C. Denkert, G. von Minckwitz
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): F. Klauschen, S. Wienert, S. Loibl, J.-U. Blohmer, T. Rüdiger, B. Lederer, G. von Minckwitz
Study supervision: F. Klauschen, M. Dietel
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.