Purpose: Recent small-sized genomic studies on the identification of breast cancer bioprofiles have led to profoundly dishomogeneous results. Thus, we sought to identify distinct tumor profiles with possible clinical relevance based on clusters of immunohistochemical molecular markers measured on a large, single institution, case series.

Experimental Design: Tumor biological profiles were explored on 633 archival tissue samples analyzed by immunohistochemistry. Five validated markers were considered, i.e., estrogen receptors (ER), progesterone receptors (PR), Ki-67/MIB1 as a proliferation marker, HER2/NEU, and p53 in their original scale of measurement. The results obtained were analyzed by three different clustering algorithms. Four different indices were then used to select the different profiles (number of clusters).

Results: The best classification was obtained creating four clusters. Notably, three clusters were identified according to low, intermediate, and high ER/PR levels. A further subdivision in two biologically distinct subtypes was determined by the presence/absence of HER2/NEU and of p53. As expected, the cluster with high ER/PR levels was characterized by a much better prognosis and response to hormone therapy compared to that with the lowest ER/PR values. Notably, the cluster characterized by high HER2/NEU levels showed intermediate prognosis, but a rather poor response to hormone therapy.

Conclusions: Our results show the possibility of profiling breast cancers by means of traditional markers, and have novel clinical implications on the definition of the prognosis of cancer patients. These findings support the existence of a tumor subtype that responds poorly to hormone therapy, characterized by HER2/NEU overexpression.

Breast cancer patients with apparently similar clinical and pathologic features can experience rather different disease dynamics or response to adjuvant therapies. This prognostic heterogeneity is considerable, and suggests a corresponding heterogeneity of the underlying biological variables. Hence, it should be possible to identify reliable, novel prognostic/predictive markers via proteomic or transcriptomic phenotyping, or by genetic analysis. Thus, molecular phenotyping might play an important role as an adjunct to classical clinical/pathologic staging procedures. Recent work focused on the quantitative identification of biological profiles of tumors from transcriptomic data. Transcriptomic profiles were then correlated with clinical behavior, according to the hypothesis that specific profiles could identify breast cancer subtypes (13). Cluster analysis (4) was applied as a statistical method capable of splitting data into subgroups (clusters) based on the relationships between the subjects and the measured variables. In breast cancer studies, cluster analyses have been conducted almost exclusively by hierarchic techniques. However, only a fraction of cases have clearly distinct features; whereas some others show a less clear-cut profile, rendering assignment to one group or another rather difficult (cluster overlap). In such a situation, a parallel application of different clustering techniques might provide alternative/more efficient solutions and might help in assessing the overall reliability and biological relevance of this strategy of analysis.

It should also be noted that very few breast cancer profiling studies have been based on large case series (5, 6). This problem is particularly critical in genetic/transcriptomic analysis studies that are typically characterized by small sample size (79). The very large number of the analyzed variables critically add to these difficulties.

The aim of this work was the profiling of breast cancers using a small number of molecular markers of biological and clinical importance. Different clustering techniques were used to improve the reliability of the conclusions reached and to assess the overall value of this strategy. To this purpose, data on a series of primary infiltrating breast cancers collected from 1983 to 1992 by the Pathology Department of the University of Ferrara were collected and analyzed.

Case data. Clinical and pathologic information were collected from a consecutive series of 922 patients who underwent surgery for primary infiltrating breast cancer between 1983 and 1992 at the University of Ferrara. The study analyzed biomarkers measured on preserved tissue samples, with no diagnostic aims and no risk communication to the patient. Moreover, only a retrospective collection of follow-up information was done. Therefore, no explicit consent from the ethical committee was required.

Data on patient age, pathologic tumor size, histologic type, pathologic stage, and number of metastatic axillary lymph nodes were collected, as well as immunohistologic determinations of estrogen receptor (ER) status, progesterone receptors (PR) status, Ki-67/MIB-1 proliferation index (Ki-67), HER2/NEU (NEU), and p53 levels. The analysis was done on 633 cases for which complete information on all pathobiological variables was present (Table 1). The percentage of expression values of ER, PR, and NEU tended to distribute around the following values: 0%, 10%, 25%, 50%, 75%, and 100%, and were consequently discretized on these values. Percentages of Ki-67- and p53-expressing cells were analyzed without discretization, although they are reported in categories for convenience.

Table 1.

Distribution in 633 cases of primary invasive breast cancer of the clinical and pathologic variables, of the discretized biologic variables ER, PR, NEU, and of the categorized (for MCA and histograms) variables, Ki-67 and p53

VariableFrequency (%)
Age (y)  
    ≤40 50 (7.9) 
    41-50 125 (19.8) 
    51-55 73 (11.5) 
    56-70 268 (42.3) 
    >70 117 (18.5) 
Histologic type  
    Ductal 483 (76.3) 
    Lobular 93 (14.7) 
    Medullary 8 (1.3) 
    Special 49 (7.7) 
Pathologic stage  
    I 390 (61.6) 
    II 186 (29.4) 
    III 8 (1.3) 
    IV 49 (7.7) 
Number of metastatic lymph nodes  
    N− 341 (53.9) 
    1-3 177 (28.0) 
    4-9 59 (9.3) 
    >9 56 (8.8) 
ERs  
    0 116 (18.3) 
    10 36 (5.7) 
    25 80 (12.6) 
    50 131 (20.7) 
    75 184 (29.1) 
    100 86 (13.6) 
PgR  
    0 182 (28.8) 
    10 79 (12.5) 
    25 65 (10.3) 
    50 76 (12.0) 
    75 108 (17.0) 
    100 123 (19.4) 
HER2/NEU  
    0 328 (51.8) 
    10 132 (20.9) 
    25 45 (7.1) 
    50 26 (4.1) 
    75 78 (12.3) 
    100 24 (3.8) 
Ki-67/MIB-1 proliferation index (Ki-67)  
    q1 (0-2.5) 129 (20.4) 
    q2 (2.5-5.75) 125 (19.7) 
    q3 (5.75-13) 143 (22.6) 
    q4 (13-30) 114 (18.0) 
    q5 (30-90.8) 122 (19.3) 
p53  
    0 (0-0.9) 293 (46.3) 
    10 (0.9-10) 175 (27.7) 
    75 (10-75) 78 (12.3) 
    100 (75-100) 87 (13.7) 
VariableFrequency (%)
Age (y)  
    ≤40 50 (7.9) 
    41-50 125 (19.8) 
    51-55 73 (11.5) 
    56-70 268 (42.3) 
    >70 117 (18.5) 
Histologic type  
    Ductal 483 (76.3) 
    Lobular 93 (14.7) 
    Medullary 8 (1.3) 
    Special 49 (7.7) 
Pathologic stage  
    I 390 (61.6) 
    II 186 (29.4) 
    III 8 (1.3) 
    IV 49 (7.7) 
Number of metastatic lymph nodes  
    N− 341 (53.9) 
    1-3 177 (28.0) 
    4-9 59 (9.3) 
    >9 56 (8.8) 
ERs  
    0 116 (18.3) 
    10 36 (5.7) 
    25 80 (12.6) 
    50 131 (20.7) 
    75 184 (29.1) 
    100 86 (13.6) 
PgR  
    0 182 (28.8) 
    10 79 (12.5) 
    25 65 (10.3) 
    50 76 (12.0) 
    75 108 (17.0) 
    100 123 (19.4) 
HER2/NEU  
    0 328 (51.8) 
    10 132 (20.9) 
    25 45 (7.1) 
    50 26 (4.1) 
    75 78 (12.3) 
    100 24 (3.8) 
Ki-67/MIB-1 proliferation index (Ki-67)  
    q1 (0-2.5) 129 (20.4) 
    q2 (2.5-5.75) 125 (19.7) 
    q3 (5.75-13) 143 (22.6) 
    q4 (13-30) 114 (18.0) 
    q5 (30-90.8) 122 (19.3) 
p53  
    0 (0-0.9) 293 (46.3) 
    10 (0.9-10) 175 (27.7) 
    75 (10-75) 78 (12.3) 
    100 (75-100) 87 (13.7) 

NOTE: The ductal histotype mixed histologic types. The special histologic type includes tubular, mucinous, papillary, and cribriform histotypes. The categorization here by age and number of metastatic lymph nodes is that used in the MCA.

Immunohistochemistry. The H222 (ER-ICA Abbott, Abbott Laboratories, Chicago, IL) and 6F11 antibodies (NeoMarkers, Fremont, CA) were used to reveal the ER with equivalent results. The KD68 (PR-ICA Abbott) and PR-1A6 (NeoMarkers) antibodies were used to reveal the PR. Staining for ER and PR was done on cryostat sections (ER-ICA or PR-ICA kits, following the manufacturer's instructions) or on paraffin-embedded sections using the streptavidin-biotin-peroxidase method (Biogenex, San Ramon, CA). Ki-67 was determined on cryostat sections with Ki-67 (Dako, Glostrup, Denmark) and on permanent section with MIB1 (Biomeda, Foster City, CA). HER2/NEU was revealed by Ab-1 (Zymed Lab, Inc., San Francisco, CA) and p53 was revealed with DO7 (NeoMarkers). Immunohistochemical procedures were done with an automatic immunostaining device (Ventana Medical System, Tucson, AZ) and Ventana Kits (Strasbourg, France).

Immunostaining for ER, PR, Ki-67, and p53 was quantified with a Computerized Image Analysis System (CAS 200, Becton Dickinson, San Jose, CA; Fig. 1), as previously described (5, 10, 11).

Fig. 1.

Computerized image analysis system workflow. Description of the area delimitation (A), masking for nuclei (B), thresholding in the masked area (C), pixel counting and averaging (D).

Fig. 1.

Computerized image analysis system workflow. Description of the area delimitation (A), masking for nuclei (B), thresholding in the masked area (C), pixel counting and averaging (D).

Close modal

Only cancer cells with distinct nuclear immunostaining for ER, PR, Ki-67, and p53 were recorded as positive. Cancer cells were considered positive for HER2/NEU when they showed distinct plasma membrane immunoreactivity. Marker expression was quantified as the percentage of tumor cells, as an input variable for cluster analysis. Preliminary analysis showed that measurements of the positive nuclear area (PNA) and quantitative immunocytochemical scores (QIC score = PNA × percentage of positive stain / 10) provided similar results (12).5

5

Unpublished results.

The percentage of stained nuclei for ER, PR, Ki-67, or p53 was calculated as the proportion of the stained area versus the total nuclear area. The percentage of stained nuclei was quantified using quantitative ER/PR/p53 analysis or cell proliferation index (Ki-67/MIB-1) software programs, respectively. Measurements were taken from 25 randomly selected microscopic fields (40× objective) and 40.000 μ2 of nuclear area in each tumor section, and average values were calculated. An additional threshold of at least 2,000 measured nuclei was applied to proliferation index estimates.

Statistical methods. Clustering algorithms were adopted for grouping tumors with similar biological characteristics. A hierarchic agglomerative algorithm with Ward's generalized criterion and two nonhierarchical techniques, K-Means and K-Medoids algorithms were applied (13, 14).

Following Tibshirani et al. (15), four indices, which could be applied to both hierarchical and nonhierarchical algorithms, were used in order to determine the optimal number of groups. Namely, the CH, KL, H, and GAP indices were used according to Calinski and Harabasz (16), Krzanowski and Lai (17), Hartigan (18), and Tibshirani et al. (15), respectively. Each algorithm was used to create between 1 and 20 groups. The values of the four indices were then calculated for each subdivision. When an index suggested the same number of clusters for different algorithms, the κ statistic (19) was used to assess agreement between the classifications produced by these different algorithms (i.e., to assess whether the clusters created by different algorithms contained the same tumors).

The frequency histograms of the biomarkers in each cluster were compared with the corresponding histogram in the whole sample. This was used to explore specific biomarker distributions across the clusters.

Multiple correspondence analysis (MCA) was used to better visualize the resulting cluster bioprofiles (20, 21). MCA can be applied both to categorical and continuous variables. For the latter, MCA has the advantage of implying neither linearity nor specific distributions. MCA allows us to visualize the association between markers and clusters on two-dimensional plots. The convenience of using two-dimensional plots is at the expense of the loss of a certain amount of information on the association patterns. To quantify the information retained in a given two-dimensional plot, the “fraction of explained information” was used (this corresponds to the percentage of the total variability that is accounted for by the two axes of the plot). The amount of information explained was calculated following Benzécri (22).

The five tumor markers under study (ER, PR, Ki-67, NEU, and p53) were used to generate the MCA plot (active information). The plot position of the categories of the active variables and the knowledge of which categories most contributed to the construction of the MCA plot were used to interpret the result obtained. Points close to each other in a plot correspond to associated marker categories and clusters. When points are close to the center of a spherical pattern, the variables are considered noncorrelated.

The number of metastatic lymph nodes, age, histology, pathologic stage, and the cluster classifications were only plotted on the existing MCA plane without modifying it (passive information), for a subsequent study of the relationships of the clinical and pathologic classifications with the biological characterization of the tumors. If a passive variable was not associated with the active variables used for the construction of the MCA plot, the categories of the passive variable should not be considered for the interpretation of the MCA results.

A test for the independence of passive variables from the active ones was used [Valeur test; ref. (21), page 123]. This test also permits us to evaluate the separation among the clusters based on the biological profile of the patients. For the sake of clarity, if the test statistic is >2, this should be considered significant at an approximately conventional 5% level. The variables Ki-67 and p53 were categorized for MCA and for univariate analysis with histograms, according to Table 1.

To evaluate the disease dynamics of patients identified by single marker values and cluster groups, event-free survival (EFS) probability was considered separately for node-negative cases without adjuvant therapy versus patients who received only hormone therapy. To this aim, EFS curves were estimated by resorting to the Kaplan-Meier method and, for single markers, compared with log-rank test. Considering the cluster groups because they were inferred on sample data, no formal statistical test was adopted to compare the estimated curves for each group, whereas relative risks at 5 years of follow-up were used to quantify their separation. Cluster and survival analyses were done with S-Plus; κ statistic values were calculated with SAS V8; MCA was done with SPAD 3 (21).

The distributions of clinical/pathologic and immunohistologic variables of the 633 cases analyzed is shown in Table 1. A high frequency of postmenopausal women was observed, as expected in a consecutive case series, with 18.5% of elderly patients (>70 years old). A small fraction of cases (11.5%) was at perimenopausal age (51-55 years old). Ductal carcinomas were the most represented (76.3%), as expected, whereas lobular carcinomas were present in ∼15%. Tumors were mainly pathologic T stage I (61.6%) and stage II (29.4%). Macroscopic axillary lymph node involvement was apparent in 46.1% of the cases, in good agreement with expectations.

Extensive work was done on the comparison between analysis on frozen and formaldehyde-fixed, paraffin-embedded sections (12).6

6

Unpublished results.

Excellent agreement between the two methods was shown, allowing an interchangeable use of the two technologies and a comprehensive analysis of the corresponding data. Immunostaining of all markers was quantified with a Computerized Image Analysis System. The percentage of positive-stained nuclei was calculated as the proportion of the positively stained area versus the total nuclear area. Measurements were the average of 25 randomly selected microscopic fields in each tumor section. We verified that at ≥15 optical fields, the measurements of the positive nuclear area reached a low SD and a stable coefficient of variation (ref. 12 and references therein). An additional threshold of at least 2,000 measured nuclei was applied to proliferation index estimates. This protocol allowed us to obtain reliable numerical values for average expression that were used for clustering purposes. Our strategy for expression measurements included thresholding for staining intensity (10, 11). This was a fixed value for each single marker across all sections. Pixel counting above threshold was subsequently done. Thus, the single measurement that was taken condensed information on expression levels and on the fraction of expressing cells in a single numerical value. The latter is in line of principle equivalent to the “total amount of a given antigen” in a tumor section. This was a distinct advantage for the chosen clustering strategy.

As a preliminary step, the prognostic effect of individual markers was evaluated by conventional univariate analysis on dichotomous ER, PR, Ki-67, HER2, and p53 values. ER was shown to be a significant predictor of response to hormone therapy in treated patients; Ki-67 proved to be a significant prognostic factor in untreated, node-negative patients (cutoff, 13%; P = 0.0398). HER2 was a significant prognostic factor both for untreated and for treated patients, resulting in a better prognosis at lower values of the marker. Lower values of p53 were associated with a better prognosis in p53-expressing untreated patients, although they did not reach conventional statistical significance (P = 0.183). The results obtained are, thus, very much in line with data from current literature.

Immunohistochemical data obtained as described were clustered using the three distinct clustering approaches. Notably, the CH index indicated three major clusters as the optimal solution for all three clustering algorithms. The κ statistic showed high overall concordance between cases classified in each of the clusters generated by each clustering algorithm (Table 2).

Table 2.

Value of the κ statistic for the evaluation of the concordance of the three groups created with the three algorithms

Concordance
K-Medoids vs. K-Means 0.97 
K-Medoids vs. Ward 0.83 
K-Means vs. Ward 0.86 
Concordance
K-Medoids vs. K-Means 0.97 
K-Medoids vs. Ward 0.83 
K-Means vs. Ward 0.86 

On the other hand, the GAP, KL, and H indices indicated different numbers of clusters depending on the algorithm (Table 3). Solutions with three or four clusters arose most frequently. The hierarchical algorithm seemed to be the least effective algorithm according to the different indices. A dendogram generated by this analytic procedure is presented in Fig. 2. A tree optimal cut-point does not seem evident.

Table 3.

Optimum number of clusters estimated by each index (absolute maximum afforded by the CH index and first maximum afforded by the other indices) for each of the three clustering algorithms (K-Means, K-Medoids, Ward)

GapCHKLH
K-Means 
K-Medoids 
Ward 14 
GapCHKLH
K-Means 
K-Medoids 
Ward 14 
Fig. 2.

Dendogram generated by the hierarchical algorithm with Ward criterion, using the AGNES routine in the S-Plus software.

Fig. 2.

Dendogram generated by the hierarchical algorithm with Ward criterion, using the AGNES routine in the S-Plus software.

Close modal

The solutions with three clusters generated by K-Means and four clusters generated by K-Medoids were compared. Clusters 1 and 3 of the two solutions essentially overlap. Cluster 4 generated by K-Medoids includes 75 cases of cluster 2 generated by K-Means. Only 10 cases are not concordant. The histograms for the five biological markers in the total sample and in each of the four clusters created by K-Medoids are shown in Fig. 3.

Fig. 3.

Histograms of ERs (A), PRs (B), proliferation index Ki-67 MIB-1 (C), HER2/NEU (D) and oncosuppressor gene p53 (E). The histograms on the top of each panel report the percentage of patients for each of the discretized values of the considered tumor marker. The four histograms in the bottom relate to the distribution in each of the four clusters created by the K-Medoids algorithm.

Fig. 3.

Histograms of ERs (A), PRs (B), proliferation index Ki-67 MIB-1 (C), HER2/NEU (D) and oncosuppressor gene p53 (E). The histograms on the top of each panel report the percentage of patients for each of the discretized values of the considered tumor marker. The four histograms in the bottom relate to the distribution in each of the four clusters created by the K-Medoids algorithm.

Close modal

Multiple correspondence analysis. The information explained by the first two axes amounted to 89%. Therefore, the two-dimensional plots are expected to be effective representations of the associations displayed. The MCA plot positions of the biological marker categories (active information) and of the K-Means and K-Medoids cluster classification (passive information) were overlayed in Fig. 4, to highlight the patterns of association.

Fig. 4.

MCA plot showing the projections of the five discretized biological markers (active information) and of age of patients, tumor histology, pT stage, number of metastatic lymph nodes and the two cluster classifications “three groups/k-means” and “four groups/k-medoids” (passive information). ER, ER0-ER10-ER25-ER50-ER75-ER100; PR, PR0-PR10-PR25-PR50-PR75-PR100; NEU, NEU0-NEU10-NEU25-NEU50-NEU75-NEU100; Ki-67, Ki-67q1-Ki-67q2-Ki-67q3-Ki-67q4-Ki-67q5; p53, p53 0-p53 10-p53 75-p53 100. Age (age, <41 for patients <41; age, 41-50; age, 51-55; age, 56-70; age, ≥71); tumor histology, ductal (which includes mixed cases), lobular, medullar, and special (which includes tubular, mucinous, papillary, and cribriform); pT stage (PT1a, PT1b, PT1c, PT2, PT3, and PT4); number of metastatic lymph nodes (N- for node-negative patients, N1-3, N4-9, N > 9); four groups/k-medoids (K-Medoids 1, K-Medoids 2, K-Medoids 3, K-Medoids 4); three groups/k-means (K-Means 1, K-Means 2, K-Means 3).

Fig. 4.

MCA plot showing the projections of the five discretized biological markers (active information) and of age of patients, tumor histology, pT stage, number of metastatic lymph nodes and the two cluster classifications “three groups/k-means” and “four groups/k-medoids” (passive information). ER, ER0-ER10-ER25-ER50-ER75-ER100; PR, PR0-PR10-PR25-PR50-PR75-PR100; NEU, NEU0-NEU10-NEU25-NEU50-NEU75-NEU100; Ki-67, Ki-67q1-Ki-67q2-Ki-67q3-Ki-67q4-Ki-67q5; p53, p53 0-p53 10-p53 75-p53 100. Age (age, <41 for patients <41; age, 41-50; age, 51-55; age, 56-70; age, ≥71); tumor histology, ductal (which includes mixed cases), lobular, medullar, and special (which includes tubular, mucinous, papillary, and cribriform); pT stage (PT1a, PT1b, PT1c, PT2, PT3, and PT4); number of metastatic lymph nodes (N- for node-negative patients, N1-3, N4-9, N > 9); four groups/k-medoids (K-Medoids 1, K-Medoids 2, K-Medoids 3, K-Medoids 4); three groups/k-means (K-Means 1, K-Means 2, K-Means 3).

Close modal

As for the contribution of the categories to the construction of the MCA axes, separation along the first axis of the MCA was provided by the categories of ER and PR absent, high NEU, Ki-67, and p53 (right side of the graph), and intermediate to high ER/PR values (left side of the graph). The second MCA axis mainly separated the highest values of ER and PR from low PR, Ki-67, and absent p53.

ER and PR steroid receptors showed a similar pattern, as expected. On the other hand, low ER values seemed associated with the highest NEU, p53, and Ki-67 values, whereas absent PR was mainly associated with high NEU expression. The highest ER/PR values were isolated on the bottom left part of the graph (cluster 1; K-Means 1 and K-Medoids 1), whereas intermediate ER/PR values were reported on the top left quadrant (cluster 2; K-Means 2 and K-Medoids 2). Null p53 and low Ki-67 and NEU also seemed to be associated with intermediate ER/PR values. Clusters 1 and 2 were therefore associated with less aggressive tumor features. Cluster 3 (K-Means 3 and K-Medoids 3) was mainly associated with low ER, high p53, and intermediate to high NEU, whereas cluster 4 (K-Medoids only) seemed mainly associated with low PR and intermediate/high NEU, the most aggressive bioprofile of the clustered tumors.

The three clusters generated by K-Means followed a pattern similar to that of ER/PR values (low, intermediate, and high) from left to right of the graph. Cluster 2 was the nearest to the center of the MCA plot, thus showing less distinguishable characteristics with respect to the total sample. The four clusters created by K-Medoids showed an additional subdivision of the tumors according to HER2/NEU and p53 values. The “Valeur test” statistic corresponding to the K-Medoids classification are reported in Table 4.

Table 4.

Valeur test statistic for assessing the independence between the cluster variable K-Medoids and the first and second axes of the MCA plot

First axisSecond axis
K-Medoids, cluster 1 −14.0 −8.1 
K-Medoids, cluster 2 −2.2 7.9 
K-Medoids, cluster 3 15.2 −3.5 
K-Medoids, cluster 4 7.8 4.6 
First axisSecond axis
K-Medoids, cluster 1 −14.0 −8.1 
K-Medoids, cluster 2 −2.2 7.9 
K-Medoids, cluster 3 15.2 −3.5 
K-Medoids, cluster 4 7.8 4.6 

The four clusters by K-Medoids seem significantly associated with the biological variables that determine the first and the second axes of the MCA plot. Notably, the second cluster is the least significant with respect to the first axis. Categories of clinical/pathologic variables (age, histology, pT stage, and number of metastatic lymph nodes) are also reported in Fig. 4 as passive information.

Several categories of such variables project close to the origin of the axes and far from the identified clusters, consistent with a weak association with the bioprofiles. On the other hand, the number of metastatic lymph nodes tended to increase from left to right along the horizontal axis, according to the more aggressive features of tumors in cluster 3 (K-Means 3 and K-Medoids 3) and cluster 4 (K-Medoids 4) versus those in clusters 1 and 2 (K-Means 1 and 2 and K-Medoids 1 and 2). Similarly, young ages and high pT values seemed associated with clusters 3 and 4 (the behavior of pT3 was less interpretable, due to the small sample size). In an opposite pattern, Special and lobular histotypes were more frequent in clusters 1 and 2 versus clusters 3 and 4.

The strong overlap between the classifications obtained with K-Means and K-Medoids supported the reliability of the two shared clusters, and the likelihood of the split of cluster 2 by K-Medoids supported the identification of two additional clusters. Hence, the solution with four clusters created by K-Medoids was used as a basis to evaluate relevant, clinical outcomes [prognosis of node-negative patients without adjuvant therapy (263 patients, Fig. 5A) and response to hormone therapy (169 patients, Fig. 5B)].

Fig. 5.

EFS and relative risks for the four groups of K-Medoids. EFS for 263 node-negative patients without hormone therapy. The Kaplan-Meier curves are based on the four clusters created by the K-Medoids algorithm (A). EFS for 169 patients treated with hormone therapy. The Kaplan-Meier curves are based on the four clusters created by the K-Medoids algorithm (B).

Fig. 5.

EFS and relative risks for the four groups of K-Medoids. EFS for 263 node-negative patients without hormone therapy. The Kaplan-Meier curves are based on the four clusters created by the K-Medoids algorithm (A). EFS for 169 patients treated with hormone therapy. The Kaplan-Meier curves are based on the four clusters created by the K-Medoids algorithm (B).

Close modal

The distribution histograms of marker values in each of the four clusters for the selected groups of patients used in survival analysis show a pattern very similar to the one reported in Fig. 2. The EFS curves of node-negative cases in clusters 1 and 2 essentially overlapped (Fig. 5A). The corresponding estimated EFS at the 5-year follow-up was about 80% in both cases. The worst EFS at 5 years of follow-up (60%) was shown by cases in cluster 3. The cases in cluster 4 showed an intermediate prognostic pattern.

As to the response to hormone therapy (Fig. 5B), the profile characterized by high ER/PR levels (cluster 1) had the best EFS at 5 years of follow-up (∼80%), consistent with a good response to hormone therapy, whereas the profile characterized by high HER2/NEU levels (cluster 4) had the worst EFS (∼25%). Of interest, this profile showed a markedly poorer response to hormonal treatment compared with cases in cluster 3, who showed even lower ER levels, together with higher proliferation index and p53 expression. The relative risks of relapse at 5 years of follow-up for clusters 2, 3, and 4 versus cluster 1 are listed in Table 5.

Table 5.

Relative risks for patients in clusters 2, 3, and 4 versus patients in cluster 1

Relative risks for patients N− without therapyRelative risks for patients with hormone therapy
Cluster 2 vs. cluster 1 0.82 1.64 
Cluster 3 vs. cluster 1 1.77 2.27 
Cluster 4 vs. cluster 1 1.47 3.70 
Relative risks for patients N− without therapyRelative risks for patients with hormone therapy
Cluster 2 vs. cluster 1 0.82 1.64 
Cluster 3 vs. cluster 1 1.77 2.27 
Cluster 4 vs. cluster 1 1.47 3.70 

Cancer development critically depends on the accumulation of multiple, driving genetic changes (23, 24). Correspondingly, specific genetic alterations have been correlated to specific stages of development of a cancer, and have been shown to appear sequentially in the course of tumor progression (24). As a consequence, it is unlikely that the analysis of isolated genetic alterations (or of the corresponding proteins) will provide adequate indications on the biological nature (aggressiveness) of a tumor. On the other hand, the analysis of clusters of molecular markers might provide appropriate means for the analysis of the biological characteristics of a tumor, including response to therapy. In particular, it may contribute to the dissection of the prognostic/predictive heterogeneity of cancers currently categorized as similar by conventional clinical and pathologic staging.

Cluster analysis is a powerful multivariate technique that allows us to investigate whether subgroups with homogeneous features could be identified in a given sample of tumors. However, it should be treated with caution, as any clustering algorithm could lead to a trivial and/or fictitious grouping of tumors, if used without proper care. In the present work, three different clustering algorithms and four different indices were adopted to assess the feasibility of grouping profiles of a large, consecutive, single-institution series of breast cancers. Of interest, a minimum of three tumor clusters was proposed by all of the algorithms used, and comparison of K-Medoids and K-Means results supports the introduction of at least one additional cluster. This number is consistent with the results of recent molecular cluster studies (25, 26), and is at partial difference with earlier articles, which showed a tendency to identify only two profiles using hierarchical algorithms (5, 8, 9).

The results of our cluster analyses highlight well-known clinical/pathologic cancer profiles along tumor progression pathways. In particular, the K-Medoids 4-groups solution seems to correspond well to models of tumor progression that go from hormone-sensitive, minimal-change lesions (clusters 1 and 2) to more advanced tumors (cluster 3-4), characterized by higher proliferative rate and by more frequent oncogene/tumor suppressor alterations. Among other “classical” pathologic prognostic factors, age at diagnosis shows a trend from less aggressive lesions (clusters 1 and 2; patients mostly older than 55) to more aggressive ones (clusters 3 and 4; younger women). Special histotypes, associated with cluster 1, confirm their hormone sensitivity and overall low proliferative and oncogene expression rates. Lobular cancers often express lower PRs with respect to “special” types; accordingly, they are more represented in cluster 2. Finally, medullary carcinomas are centered close to cluster 3, confirming their high proliferative rate and down-regulation of hormone receptors.

Metastatic progression, in terms of number of metastatic lymph nodes, did not seem to strongly influence cluster profiles, although patients in clusters 3 and 4 tended to have a higher number of metastatic lymph nodes than patients in clusters 1 and 2. Explicit measurements of metastatic propensity of primary tumors were beyond the scope of this work. However, we note that the results obtained are consistent with models in which transformed cells possess a diffuse metastatic ability at early stages of tumor development (27). On the other hand, our findings do not support models in which metastatic development depends on the progressive accumulation of favorable mutation, that leads to the emergence of metastatic cells only at late stages of tumor development (28).

The MCA patterns highlighted above were in close consistency with those observed in previous analyses of independent case series (2931). In particular, a similar distribution of ER- and PR-expressing cases was observed for patients without axillary lymph node involvement, in spite of the use of biochemical measurement methods (30, 31) instead of immunohistochemistry. Moreover, low, intermediate and high levels of p53 were shown to be associated with intermediate, high, and low ER and PR, respectively (30). Consistent trends of other markers (Ki-67 and HER2/NEU) were also recorded, although in the context of less refined dichotomous classifications (29). Because cluster profiles are clearly associated with the above patterns in independent studies, such profiles are expected to be of widespread significance for breast cancer classification, both in terms of biological characteristics and of response to hormonal therapy.

Recent transcriptomic studies have been proposed to subdivide breast tumors into luminal and basal subtypes, according to their ER levels (rich and poor, respectively; refs. 1, 2). Our findings are consistent with this classification. However, in our hands, the ER-positive tumors were further subdivided in clusters 1 and 2 according to different PR levels (high for cluster 1 and medium/low for cluster 2). Of interest, heterogeneity among ER-positive tumors was also shown by genomic (32) and immunohistochemical (9) analyses, although on small samples of cases. It is noteworthy that the separation of clusters 1 and 2 in terms of ER/PR can be observed only if markers are evaluated as a continuum or at least in ordinal scales, not when they are considered as dichotomous variables (positive versus negative). Therefore, avoiding the use of cutoff values allows us to identify bioprofiles that would have been otherwise hidden. Given current clinical practices, this is of clear relevance for future studies and for potential applications in clinical settings.

EFS followed distinct trends in the four groups. In particular, groups 1 and 2 have better EFS than groups 3 and 4 for both treated and untreated patients. A direct comparison of the curves for patients with different therapies is not appropriate because therapy was assigned according to different clinical/biological features to begin with. However, it is worth noticing that group 3 showed the worst EFS among nontreated patients whereas group 4 had the worst performance among treated patients. These findings are consistent with a lack of response to tamoxifen for tumors with high expression of HER2/NEU (33). This result is particularly relevant as lack of sensitivity to hormonal treatment cannot be attributed to low ER values only (34), as clusters 3 and 4 showed equally low ER expression. The differential EFS of cluster 3 versus cluster 4 in treated versus untreated patients is in agreement with the results reported in ref. (35) where “c-ErbB-2 status defined a group of patients with a poor prognosis among those usually considered to have good prognosis, such as patients with low p53 values.”

Of interest, Perou et al. (1) reported that ER-negative breast carcinomas encompass at least two biologically distinct tumor subtypes (basal-like and HER2/NEU-positive) “which may need to be treated as distinct diseases.” The existence of a subgroup of basal-like tumors with HER2/NEU overexpression was confirmed by others (36, 37). Notably, the profiles of cluster 3 (with NEU distribution close to that of the total sample) and cluster 4 (with prevalent high NEU values) correspond to this distinction. However, most transcriptomic studies are limited by their small sample size and by the low signal/noise ratio of the measurements done (38). Thus, evidence for specific grouping and prognostic procedures coming from these studies should be treated with caution. Interestingly, three clusters were also found by immunohistochemical analysis of cytokeratin expression versus conventional markers on tissue microarrays (7). One of the clusters was characterized by HER2/NEU overexpression. The other two were distinguished by the expression of “basal” CK5/6 versus CK8/18, which were correlated with low and high ER levels, respectively. The HER2/NEU cluster was separated from that expressing CK5/6 and p53. These results remind of the separation between cluster 3 (p53 high) and cluster 4 (HER2/NEU high) in the present study. Furthermore, a recent tissue microarray study (25) proposed a three-cluster classification, where at least clusters 1 and 3 seemed to correspond with the homologous ones of the present study.

The four-cluster solution of this report does not imply that the underlying number of tumor subtypes is truly four. Indeed, one might expect that by increasing the number of investigated markers and/or the precision of their measurement, finer subdivisions would emerge. Cluster 2 in this work indeed seems more heterogeneous compared with the other three. Thus, we expect that this group of tumors might be split into more homogeneous subgroups with further investigation.

Grant support: Supported in part by the Italian Ministero dell'Istruzione, dell'Università e della Ricerca, the European Union Network of Excellence “Biopattern” (FP6-2002-IST-1 no. 508803), and by the Italian Association for Cancer Research.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

Note: F. Ambrogi and E. Biganzoli contributed equally to this work.

We thank an anonymous reviewer for the constructive criticism and useful suggestions and Dr. Massimo Pedriali who prepared Fig. 1.

1
Perou CM, Sørlie T, Eisen MB, et al. Molecular portraits of human breast tumors.
Nature
2000
;
406
:
747
–52.
2
van't Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer.
Nature
2002
;
415
:
530
–6.
3
Iwao K, Matoba R, Ueno N, et al. Molecular classification of primary breast tumors possessing distinct prognostic properties.
Hum Mol Genet
2002
;
15
:
199
–206.
4
McLachlan GJ. Cluster analysis and related techniques in medical research.
Stat Methods Med Res
1992
;
1
:
27
–48.
5
Querzoli P, Ferretti S, Albonico G, et al. Application of quantitative analysis to biologic profile evaluation in breast cancer.
Cancer
1995
;
76
:
2510
–7.
6
Menard S, Casalini P, Tomasic G, et al. Pathobiologic identification of two distinct breast carcinoma subsets with diverging clinical behaviors.
Breast Cancer Res Treat
1999
;
55
:
169
–77.
7
Korsching E, Packeisen J, Agelopoulos K, et al. Cytogenetic alterations and cytokeratin expression patterns in breast cancer: integrating a new model of breast differentiation into cytogenetic pathways of breast carcinogenesis.
Lab Invest
2002
;
82
:
1525
–33.
8
Korsching E, Packeisen J, Helms MW, et al. Deciphering a subgroup of breast carcinomas with putative progression of grade during carcinogenesis revealed by comparative genomic hybridisation (CGH) and immunohistochemistry.
Br J Cancer
2004
;
90
:
1422
–8.
9
Yoshida N, Omoto Y, Inoue A, et al. Prediction of prognosis of estrogen receptor-positive breast cancer with combination of selected estrogen-regulated genes.
Cancer Sci
2004
;
95
:
496
–502.
10
Bacus S, Flowers JL, Press MF, Bacus JW, McCarty KS, Jr. The evaluation of estrogen receptor in primary breast carcinoma by computer-assisted image analysis.
Am J Clin Pathol
1998
;
90
:
233
–9.
11
Esteban JM, Battifora H, Warsi Z, Bailey A, Bacus S. Quantification of estrogen receptors on paraffin-embedded tumors by image analysis.
Mod Pathol
1991
;
4
:
53
–7.
12
Querzoli P, Albonico G, Ferretti S, et al. MIB-1 proliferative activity in invasive breast cancer measured by image analysis.
J Clin Pathol
1996
;
49
:
926
–30.
13
Kaufman L, Rousseeuw P. Finding groups in data. New York: Wiley; 1990.
14
S-Plus 2000 Guide to statistics. Seattle: Mathsoft; 1999.
15
Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic.
J R Stat Soc [Ser B]
2001
;
63
:
411
–23.
16
Calinski RB, Harabasz J. A dendride method for cluster analysis.
Commun Stat
1974
;
3
:
1
–27.
17
Krzanowski WJ, Lai YT. A criterion for determining the number of groups in a data set using sum of squares clustering.
Biometrics
1985
;
44
:
23
–34.
18
Hartigan J. Clustering algorithms. New York: Wiley; 1975.
19
Fleiss JL. Statistical methods for rates and proportions. New York: Wiley; 1981.
20
Greenacre MJ. Theory and applications of correspondence analysis. Academic Press; 1994.
21
Lebart L, Morineau A, Piron M. Statistique exploratoire multidimensionnelle. Paris: Dunod; 1995.
22
Benzécri JP. Sur le calcul des taux d'inertie dans l'analyse d'un questionnaire.
Cah Anal Donnees
1979
;
4
:
377
–8.
23
Hanahan D, Weinberg RA. The hallmarks of cancer.
Cell
2000
;
100
:
57
–70.
24
Vogelstein B, Kinzler KW. The multistep nature of cancer.
Trends Genet
1993
;
9
:
138
–41.
25
Makretsov NA, Huntsman DG, Nielsen TO, et al. Hierarchical clustering analysis of tissue microarray immunostaining data identifies prognostically significant groups of breast carcinoma.
Clin Cancer Res
2004
;
10
:
6143
–51.
26
Ahr A, Holtrich U, Solbach C, et al. Molecular classification of breast cancer patients by gene expression profiling.
J Pathol
2001
;
195
:
312
–20.
27
Gray JW. Evidence emerges for early metastasis and parallel evolution of primary and metastatic tumors.
Cancer Cell
2003
;
4
:
4
–6.
28
Hynes RO. Metastatic potential: generic predisposition of the primary tumor or rare, metastatic variants—or both?
Cell
2003
;
113
:
821
–3.
29
Gasparini G, Boracchi P, Bevilacqua P, Mezzetti M, Pozza F, Weidner N. A multiparametric study on the prognostic value of epidermal growth factor receptor in operable breast carcinoma.
Breast Cancer Res Treat
1994
;
29
:
59
–71.
30
Gion M, Boracchi P, Dittadi R, et al. Quantitative measurement of soluble cytokeratin fragments in tissue cytosol of 599 node negative breast cancer patients: a prognostic marker possibly associated with apoptosis.
Breast Cancer Res Treat
2000
;
59
:
211
–21.
31
Coradini D, Daidone MG, Boracchi P, et al. Time-dependent relevance of steroid receptors in breast cancer.
J Clin Oncol
2000
;
18
:
2702
–9.
32
Sorlie T, Perou CM, Tibshirani R, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications.
Proc Natl Acad Sci U S A
2001
;
98
:
10869
–74.
33
Hait WN. The prognostic and predictive values of ECD-HER-2.
Clin Cancer Res
2001
;
7
:
2601
–4.
34
Arpino G, Green SJ, Allred DC, et al. HER-2 amplification, HER-1 expression, and tamoxifen response in estrogen receptor-positive metastatic breast cancer: a southwest oncology group study.
Clin Cancer Res
2004
;
10
:
5670
–6.
35
Ferrero-Pous M, Hacene K, Bouchet C, Le Doussal V, Tubiana-Hulin M, Spyratos F. Relationship between c-erbB-2 and other tumor characteristics in breast cancer prognosis.
Clin Cancer Res
2000
;
6
:
4745
–54.
36
Sotiriou C, Neo SY, McShane LM, et al. Breast cancer classification and prognosis based on gene expression profiles from a population-based study.
Proc Natl Acad Sci U S A
2003
;
100
:
10393
–8.
37
Yu K, Lee CH, Tan PH, et al. A molecular signature of the Nottingham prognostic index in breast cancer.
Cancer Res
2004
;
64
:
2962
–8.
38
Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification.
J Natl Cancer Inst
2003
;
95
:
14
–8.