Abstract
Scoring of immunohistochemistry (IHC) staining is often done by non-pathologists, especially in large-scale tissue microarray (TMA)-based studies. Studies on the validity and reproducibility of scoring results from non-pathologists are limited. Therefore, our main aim was to assess interobserver agreement between trained non-pathologists and an experienced histopathologist for three IHC markers with different subcellular localization (nucleus/membrane/cytoplasm).
Three non-pathologists were trained in recognizing adenocarcinoma and IHC scoring by a senior histopathologist. Kappa statistics were used to analyze interobserver and intraobserver agreement for 6,249 TMA cores from a colorectal cancer series.
Interobserver agreement between non-pathologists (independently scored) and the histopathologist was “substantial” for nuclear and membranous IHC markers (κrange = 0.67–0.75 and κrange = 0.61–0.69, respectively), and “moderate” for the cytoplasmic IHC marker (κrange = 0.43–0.57). Scores of the three non-pathologists were also combined into a “combination score” (if at least two non-pathologists independently assigned the same score to a core, this was the combination score). This increased agreement with the pathologist (κnuclear = 0.74; κmembranous = 0.73; κcytopasmic = 0.57). Interobserver agreement between non-pathologists was “substantial” (κnuclear = 0.78; κmembranous = 0.72; κcytopasmic = 0.61). Intraobserver agreement of non-pathologists was “substantial” to “almost perfect” (κnuclear,range = 0.83–0.87; κmembranous,range = 0.75–0.82; κcytopasmic = 0.69). Overall, agreement was lowest for the cytoplasmic IHC marker.
This study shows that adequately trained non-pathologists are able to generate reproducible IHC scoring results, that are similar to those of an experienced histopathologist. A combination score of at least two non-pathologists yielded optimal results.
Non-pathologists can generate reproducible IHC results after appropriate training, making analyses of large-scale molecular pathological epidemiology studies feasible within an acceptable time frame.
Introduction
The introduction of the tissue microarray (TMA) technology by Kononen and colleagues (1) in 1998 has enabled large-scale studies using archival formalin-fixed paraffin-embedded (FFPE) tissue blocks (2, 3). The TMA technology has the advantage that sampling of cores leaves the donor block relatively intact, allowing it to be sampled multiple times (3, 4). Furthermore, immunohistochemistry (IHC) on TMAs is cost effective and less time consuming than performing IHC on full tissue sections (2–6). In addition, a higher level of assay standardization can be achieved, improving reproducibility of results (3, 4, 6–8).
Several studies have shown a high degree of concordance between IHC results obtained from TMA sections and full sections when three 0.6 mm cores per case were used (9–13). Interestingly, a study by Gavrielides and colleagues (14) found slightly higher interobserver agreement for HER2 scoring on TMAs compared with full sections, suggesting a potential benefit of the restricted field of view.
Manual scoring of TMA sections can take a considerable amount of time if individual scores need to be provided for hundreds or thousands of cores (7, 15). Although scoring by automated image analysis has been proposed as a potential alternative to manual scoring, IHC markers present in tumor cells and other cell populations at the same time are challenging to assess automatically (16).
Scoring of IHC stained sections is often done by non-pathologists (17, 18). However, studies on the validity of results from non-pathologists are limited. Jaraj and colleagues (19) suggested that after adequate training, non-pathologists are able to produce valid and reproducible IHC results for a cytoplasmic marker. However, it has been suggested that apart from the expert histopathologist knowledge, the agreement of IHC results between observers might also be affected by the subcellular localization of the marker of interest (nucleus/membrane/cytoplasm) (20). There is a limited number of studies investigating scoring agreement of markers with different subcellular localizations. One of these studies reported similar overall kappa values for scoring of staining in different subcellular compartments (21, 22), whereas another study reported considerably lower agreement for scoring of cytoplasmic immunostaining (23).
We hypothesized that there is good interobserver agreement between trained non-pathologists and pathologists for IHC scoring on TMAs, and that the interobserver agreement does not depend on the subcellular localization of the staining. Therefore, the aims of the current study were to (i) assess interobserver agreement between trained non-pathologists and an experienced pathologist, and (ii) assess agreement of three IHC markers with different subcellular localization (nucleus/membrane/cytoplasm).
Materials and Methods
Study population, tissue collection, and TMA construction
For TMA construction, tissue blocks from colorectal cancer resections of cases from the Netherlands Cohort Study (NLCS) were collected retrospectively from Dutch hospitals (24–26). Hematoxylin & eosin (H&E)-stained sections were reviewed and the area with the highest tumor density was identified. From this area, three 0.6-mm-diameter cores with tumor and three cores with normal epithelium were sampled per case for TMA construction (TMA-Grandmaster, 3DHISTEC). In total, 78 TMA blocks were constructed containing 7,963 tumor cores.
Ethical approval was obtained from Medical Ethical Committee MUMC, number METC 2019-1085.
Immunohistochemistry
Five μm thick serial sections were cut from all 78 TMA blocks and subjected to IHC using an automated immunostainer (DAKO Autostainer Link 48, Glostrup). TP53, GLUT1, and PTEN were chosen as markers to assess interobserver and intraobserver agreement in scoring nuclear, membranous, and cytoplasmic immunoreactivity, respectively, as these are established IHC markers routinely used in clinical setting. Details of primary antibodies and staining protocols are shown in Table 1. Staining protocols for all markers were optimized to eliminate background and nonspecific staining. Sections were counterstained with Mayer's Hematoxylin (VWR International B.V.), dehydrated, and mounted with a glass coverslip and xylene-based mounting medium (DPX, Sigma-Aldrich). All TMA sections were scanned using the Aperio scanner (Leica Microsystems) at 40× magnification at the University of Leeds (Leeds, UK) Scanning Facility.
Antibody . | Clone . | Supplier (catalog number) . | Antigen retrieval . | Dilution . | Incubation time . | Visualization system . | Chromogen . |
---|---|---|---|---|---|---|---|
Pan-CK | AE1/AE3 | DAKO (GA05361-2) | PT higha | RTUb | 10 minutes | EnVision FLEXc | DABe |
TP53 | DO-7 | DAKO (M700101-2) | PT higha | RTUb | 20 minutes | EnVision FLEXc | DABe |
GLUT1 | — | Thermo Fisher Scientific (RB-9052-P) | PT lowd | 1:200 | 20 minutes | EnVision FLEXc | DABe |
PTEN | 6H2.1 | DAKO (M362729-2) | PT higha | 1:100 | 20 minutes | EnVision FLEXc | DABe |
Antibody . | Clone . | Supplier (catalog number) . | Antigen retrieval . | Dilution . | Incubation time . | Visualization system . | Chromogen . |
---|---|---|---|---|---|---|---|
Pan-CK | AE1/AE3 | DAKO (GA05361-2) | PT higha | RTUb | 10 minutes | EnVision FLEXc | DABe |
TP53 | DO-7 | DAKO (M700101-2) | PT higha | RTUb | 20 minutes | EnVision FLEXc | DABe |
GLUT1 | — | Thermo Fisher Scientific (RB-9052-P) | PT lowd | 1:200 | 20 minutes | EnVision FLEXc | DABe |
PTEN | 6H2.1 | DAKO (M362729-2) | PT higha | 1:100 | 20 minutes | EnVision FLEXc | DABe |
aHigh pH retrieval (K8004) for 20 minutes on the Dako PT link (Agilent Technologies).
bRTU: ready-to-use.
cEnVision FLEX Visualization Kit (K8008, DAKO).
dLow pH retrieval (K8005) for 20 minutes on the Dako PT link (Agilent Technologies).
eDAB: 3,3′-diaminobenzidine.
Quality control
Presence of adenocarcinoma was confirmed for every individual core by reviewing the H&E-stained TMA sections. In case of tumor identification difficulties because of poor tumor differentiation or a large number of inflammatory cells, pan-cytokeratin staining was used to identify tumor cells.
Immunohistochemical scoring
Three non-pathologists (G.E. Fazzi: histology technician; K. Offermans: PhD student; J.C.A. Jenniskens: PhD student) were trained by a senior histopathologist (H.I. Grabsch) in (i) recognizing adenocarcinoma on H&E-stained TMA sections; (ii) recognizing immunoreactivity and distinguishing between immunoreactivity in the nucleus, membrane, and cytoplasm; and (iii) scoring of two TMA sections (∼200 cores) for every immunostaining to ensure that the same criteria were used by all assessors.
After training, the three non-pathologists scored all tumor cores for TP53, GLUT1, and PTEN immunostainings. The scores from the three non-pathologists were combined into a “combination score.” If at least two non-pathologists independently assigned the same score to a core, this score became the combination score. If all non-pathologists assigned different scores, the core was categorized as “no agreement.” Because not all cores were scored by three non-pathologists for GLUT1 (Table 2), the remaining scores of the combination score were based on two non-pathologists. When comparing scores from pairs of trained non-pathologists to the score of the pathologist, non-pathologists' scores were combined as described for the combination score of three non-pathologists.
Assessor . | Experience . | Nuclear (TP53) . | Membranous (GLUT1) . | Cytoplasmic (PTEN) . | Intraobserverb . |
---|---|---|---|---|---|
1 | NP | 100% | 25%a | 100% | X |
2 | NP | 100% | 100% | 100% | 10% |
3 | NP | 100% | 100% | 100% | 10% |
4 | P | 10% | 10% | 10% | X |
Assessor . | Experience . | Nuclear (TP53) . | Membranous (GLUT1) . | Cytoplasmic (PTEN) . | Intraobserverb . |
---|---|---|---|---|---|
1 | NP | 100% | 25%a | 100% | X |
2 | NP | 100% | 100% | 100% | 10% |
3 | NP | 100% | 100% | 100% | 10% |
4 | P | 10% | 10% | 10% | X |
Abbreviations: NP, non-pathologist; P, pathologist.
aAssessor 1 left the project early because of an unforeseen work relocation.
bPercentage of slides rescored per protein.
For evaluation of intraobserver agreement, two non-pathologists (assessor 2 and 3) evaluated 10% randomly selected TMA sections (range: 538–681 cores) per marker for a second time after a period of at least 5 months. These scores were only used to assess intraobserver agreement. To assess interobserver agreement between pathologist and non-pathologists, an experienced pathologist (I. Samarska) evaluated the same 10% randomly selected TMA sections for every marker. The contribution of each assessor to the IHC scoring of the different markers is shown in Table 2.
TP53 positivity was defined as unequivocal strong nuclear staining and scored semiquantitatively as published previously (13, 27), with minor adaptations, as: (i) no positive tumor nuclei; (ii) ≤10% positive tumor nuclei; (iii) 11% to 50% positive tumor nuclei; (iv) 51% to 90% positive tumor nuclei; and (v) 91% to 100% positive tumor nuclei (Fig. 1A).
GLUT1 positivity was defined as any membranous (complete or incomplete) immunostaining of tumor cells, and scored as published previously (28, 29): (i) no tumor cells with membranous immunostaining; (ii) ≤10% tumor cells with membranous immunostaining; (iii) 11% to 50% tumor cells with membranous immunostaining; (iv) >50% tumor cells with membranous immunostaining (Fig. 1B).
PTEN scoring was performed as described previously (30), comparing cytoplasmic immunostaining intensity of the tumor cells with that of adjacent stromal cells. PTEN immunostaining was classified as: (i) negative (no PTEN staining in the tumor cells); (ii) weak (staining intensity in the tumor cells weaker than in the stromal cells); (iii) moderate (similar staining intensity in tumor and stromal cells); or (iv) strong (staining intensity in the tumor cells stronger than in the stromal cells), see Fig. 1C. In case of heterogeneous immunostaining, the region with the highest staining intensity prevailed.
Uninterpretable (e.g., folded cores) or missing cores were categorized as “uninterpretable” and excluded from analyses for all markers.
Statistical analysis
Interobserver and intraobserver agreement was assessed using all cores that passed quality control. Cohen's kappa was used for assessing interobserver agreement between assessor pairs and for assessing intraobserver agreement within one assessor (31). Fleiss' kappa was used for assessing interobserver agreement between more than two assessors (32). All kappa values were weighted (33), taking into account the magnitude of the disagreement (e.g., ≤10% vs. >50% is worse than ≤10% vs. 11%–50%). A weight of 0.5 was chosen for scoring an adjacent category and a weight of zero for non-adjacent categories. Non-weighted Fleiss' kappa was used for assessing the variation in interobserver agreement between scoring categories. To calculate kappa confidence intervals, the bootstrap method was used with 1,000 repetitions (34–36). The interpretation of kappa values is shown in Supplementary Table S1. Agreement between each pair of assessors was determined, as well as agreement between the combination score of two or three non-pathologists and the pathologist's score (for the latter, cores for which no agreement was reached were excluded from analyses). Data were analyzed using Stata (version 15.1, Statacorp).
Results
In total, 78 TMA blocks containing 7,963 tumor cores were available. After quality control, 1,714 (21.5%) cores were excluded (464 missing cores; 1,135 cores lacking tumor tissue; 115 uninterpretable tissue cores), leaving 6,249 tumor cores for analyses. All cores were evaluated by at least two assessors (Table 2). Frequency distributions of scores assigned by all assessors for nuclear (TP53), membranous (GLUT1), and cytoplasmic (PTEN) immunoreactivity are shown in Supplementary Tables S2–S4.
Interobserver agreement
Non-pathologists versus pathologist
Weighted kappa values of interobserver agreement between non-pathologists and pathologist are shown in Table 3 (non-weighted kappa values in Supplementary Table S5). Kappa values of each individual non-pathologist with the pathologist showed “substantial” agreement for nuclear (κrange = 0.67–0.75) and membranous immunostainings (κrange = 0.61–0.69), and “moderate” for cytoplasmic immunostaining (κrange = 0.43–0.57). The combination score of the three non-pathologists showed “substantial” agreement with the pathologist's score for nuclear (κ = 0.74) and membranous immunoreactivity (κ = 0.73), and “moderate” agreement for cytoplasmic immunoreactivity (κ = 0.57). The combination score of two non-pathologists showed similar agreement with the pathologist's score as the combination score of three non-pathologists (κnuclear,range = 0.75–0.81; κmembranous,range = 0.75–0.79; κcytoplasmic,range = 0.54–0.65). For the majority of scores (range, 90.3%–98.6%), equal or adjacent scoring categories were assigned (Table 4) by pathologist and non-pathologists.
. | Nuclear . | Membranous . | Cytoplasmic . |
---|---|---|---|
. | κ (95% CI) . | κ (95% CI) . | κ (95% CI) . |
NP vs. Pa | |||
1 vs. 4 | 0.75 (0.72–0.79) | 0.61 (0.55–0.67)f | 0.57 (0.53–0.61) |
2 vs. 4 | 0.67 (0.63–0.71) | 0.69 (0.65–0.73) | 0.43 (0.38–0.48) |
3 vs. 4 | 0.70 (0.67–0.74) | 0.69 (0.66–0.73) | 0.56 (0.52–0.60) |
1+2 vs. 4b,c | 0.80 (0.77–0.84) | 0.77 (0.70–0.83)f | 0.57 (0.51–0.62) |
1+3 vs. 4b,c | 0.81 (0.77–0.84) | 0.79 (0.73–0.85)f | 0.65 (0.60–0.70) |
2+3 vs. 4b,c | 0.75 (0.72–0.79) | 0.75 (0.72–0.79) | 0.54 (0.50–0.60) |
Combination scorec,d vs. 4 | 0.74 (0.71–0.78) | 0.73 (0.69–0.77) | 0.57 (0.52–0.61) |
NP vs. NPe | |||
1 vs. 2 | 0.74 (0.73–0.75) | 0.69 (0.67–0.72)f | 0.55 (0.54–0.57) |
1 vs. 3 | 0.79 (0.79–0.80) | 0.66 (0.64–0.69)f | 0.64 (0.62–0.65) |
2 vs. 3 | 0.80 (0.79–0.81) | 0.81 (0.80–0.82) | 0.65 (0.64–0.67) |
1 vs. 2 vs. 3g | 0.78 | 0.72f | 0.61 |
. | Nuclear . | Membranous . | Cytoplasmic . |
---|---|---|---|
. | κ (95% CI) . | κ (95% CI) . | κ (95% CI) . |
NP vs. Pa | |||
1 vs. 4 | 0.75 (0.72–0.79) | 0.61 (0.55–0.67)f | 0.57 (0.53–0.61) |
2 vs. 4 | 0.67 (0.63–0.71) | 0.69 (0.65–0.73) | 0.43 (0.38–0.48) |
3 vs. 4 | 0.70 (0.67–0.74) | 0.69 (0.66–0.73) | 0.56 (0.52–0.60) |
1+2 vs. 4b,c | 0.80 (0.77–0.84) | 0.77 (0.70–0.83)f | 0.57 (0.51–0.62) |
1+3 vs. 4b,c | 0.81 (0.77–0.84) | 0.79 (0.73–0.85)f | 0.65 (0.60–0.70) |
2+3 vs. 4b,c | 0.75 (0.72–0.79) | 0.75 (0.72–0.79) | 0.54 (0.50–0.60) |
Combination scorec,d vs. 4 | 0.74 (0.71–0.78) | 0.73 (0.69–0.77) | 0.57 (0.52–0.61) |
NP vs. NPe | |||
1 vs. 2 | 0.74 (0.73–0.75) | 0.69 (0.67–0.72)f | 0.55 (0.54–0.57) |
1 vs. 3 | 0.79 (0.79–0.80) | 0.66 (0.64–0.69)f | 0.64 (0.62–0.65) |
2 vs. 3 | 0.80 (0.79–0.81) | 0.81 (0.80–0.82) | 0.65 (0.64–0.67) |
1 vs. 2 vs. 3g | 0.78 | 0.72f | 0.61 |
Abbreviations: NP, non-pathologist; P, pathologist. Nuclear, TP53; membranous, GLUT1; cytoplasmic, PTEN.
aBased on a random 10% of TMA sections (range, 538–681 cores).
bComparison of a combination of two non-pathologists with the pathologist: if the two non-pathologists independently assigned the same score to a core, this was the combined score. If the non-pathologists assigned a different score, the core was categorized as no agreement.
cCores where no agreement was reached between non-pathologists (combination score = no agreement) were excluded for analyses.
dThe combination score is based on all three non-pathologist's scores: if at least two assessors independently assigned the same score to a core, this was the combination score. If none of the assessors assigned the same score, the core was categorized as no agreement.
eBased on all cores (N = 6,249).
fAssessor 1 left the project early because of an unforeseen work relocation, 1,457 cores were evaluated.
gConfidence interval for weighted kappa of multiple assessors (>2) could not be calculated using Stata.
. | Nuclear . | Membranousa . | Cytoplasmic . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Difference in categoriesb . | Difference in categoriesb . | Difference in categoriesb . | |||||||||
. | 0 . | 1 . | 2 . | 3/4 . | 0 . | 1 . | 2 . | 3 . | 0 . | 1 . | 2 . | 3 . |
Interobserver | ||||||||||||
NP vs. Pc | ||||||||||||
1 vs. 4 | 57.9 | 32.4 | 8.0 | 1.8 | 62.5 | 29.4 | 6.0 | 2.2 | 54.1 | 42.1 | 3.9 | 0.0 |
2 vs. 4 | 51.4 | 41.0 | 6.7 | 0.9 | 70.5 | 24.5 | 4.0 | 1.0 | 63.0 | 33.9 | 2.7 | 0.5 |
3 vs. 4 | 61.7 | 32.0 | 5.1 | 1.2 | 70.6 | 25.2 | 3.1 | 1.0 | 62.1 | 36.5 | 1.4 | 0.0 |
Combination vs. 4 | 69.8 | 27.1 | 2.9 | 0.2 | 73.7 | 22.7 | 3.0 | 0.7 | 64.6 | 34.8 | 0.7 | 0.0 |
NP vs. NPd | ||||||||||||
1 vs. 2 | 72.0 | 24.2 | 3.5 | 0.3 | 68.3 | 29.2 | 2.1 | 0.4 | 65.4 | 34.0 | 0.6 | 0.0 |
1 vs. 3 | 76.3 | 22.2 | 1.4 | 0.1 | 65.5 | 30.9 | 3.2 | 0.4 | 73.2 | 26.5 | 0.3 | 0.0 |
2 vs. 3 | 76.3 | 22.4 | 1.2 | 0.1 | 81.4 | 17.2 | 1.3 | 0.1 | 76.0 | 23.8 | 0.2 | 0.0 |
Intraobserver | ||||||||||||
NP vs. NPc | ||||||||||||
2 vs. 2 | 82.6 | 16.3 | 1.0 | 0.0 | 82.4 | 16.6 | 1.0 | 0.0 | 78.8 | 21.2 | 0.0 | 0.0 |
3 vs. 3 | 84.1 | 15.3 | 0.6 | 0.0 | 74.2 | 24.8 | 0.8 | 0.3 | 80.0 | 20.0 | 0.0 | 0.0 |
. | Nuclear . | Membranousa . | Cytoplasmic . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Difference in categoriesb . | Difference in categoriesb . | Difference in categoriesb . | |||||||||
. | 0 . | 1 . | 2 . | 3/4 . | 0 . | 1 . | 2 . | 3 . | 0 . | 1 . | 2 . | 3 . |
Interobserver | ||||||||||||
NP vs. Pc | ||||||||||||
1 vs. 4 | 57.9 | 32.4 | 8.0 | 1.8 | 62.5 | 29.4 | 6.0 | 2.2 | 54.1 | 42.1 | 3.9 | 0.0 |
2 vs. 4 | 51.4 | 41.0 | 6.7 | 0.9 | 70.5 | 24.5 | 4.0 | 1.0 | 63.0 | 33.9 | 2.7 | 0.5 |
3 vs. 4 | 61.7 | 32.0 | 5.1 | 1.2 | 70.6 | 25.2 | 3.1 | 1.0 | 62.1 | 36.5 | 1.4 | 0.0 |
Combination vs. 4 | 69.8 | 27.1 | 2.9 | 0.2 | 73.7 | 22.7 | 3.0 | 0.7 | 64.6 | 34.8 | 0.7 | 0.0 |
NP vs. NPd | ||||||||||||
1 vs. 2 | 72.0 | 24.2 | 3.5 | 0.3 | 68.3 | 29.2 | 2.1 | 0.4 | 65.4 | 34.0 | 0.6 | 0.0 |
1 vs. 3 | 76.3 | 22.2 | 1.4 | 0.1 | 65.5 | 30.9 | 3.2 | 0.4 | 73.2 | 26.5 | 0.3 | 0.0 |
2 vs. 3 | 76.3 | 22.4 | 1.2 | 0.1 | 81.4 | 17.2 | 1.3 | 0.1 | 76.0 | 23.8 | 0.2 | 0.0 |
Intraobserver | ||||||||||||
NP vs. NPc | ||||||||||||
2 vs. 2 | 82.6 | 16.3 | 1.0 | 0.0 | 82.4 | 16.6 | 1.0 | 0.0 | 78.8 | 21.2 | 0.0 | 0.0 |
3 vs. 3 | 84.1 | 15.3 | 0.6 | 0.0 | 74.2 | 24.8 | 0.8 | 0.3 | 80.0 | 20.0 | 0.0 | 0.0 |
Note: Uninterpretable cores were excluded.
Abbreviations: NP, non-pathologist; P, pathologist. Nuclear, TP53; membranous, GLUT1; cytoplasmic, PTEN.
aAssessor 1 left the project early because of an unforeseen work relocation.
bDifference in categories assigned by the two assessors: 0 = same category assigned (no discrepancy); 1 = adjacent categories were assigned (e.g.,<10% positive and 11%–50% positive); 2 = difference between assigned categories was 2 (e.g., <10% positive and >50% positive); 3/4 = difference between assigned categories was 3 or 4 (e.g., negative and >50%).
cBased on a random 10% of TMA sections.
dBased on all TMA sections.
In Supplementary Table S6, the agreement per scoring category is shown by non-weighted kappa values. The lowest and highest scoring categories show higher agreement among non-pathologist assessors (κnuclear 0.83 and 0.79; κmembranous 0.68 and 0.82; κcytoplasmic 0.61 and 0.51, respectively), than the scoring categories in between (κnuclear,range = 0.35–0.56; κmembranous,range = 0.45–0.53; κcytoplasmic,range = 0.49–0.53). Adding the pathologist assessor, this again led to highest agreement in the most extreme categories for nuclear and membranous stainings (κnuclear 0.86 and 0.67; κmembranous 0.74 and 0.76, respectively). For cytoplasmic stainings the agreement was highest for the lowest scoring category, and decreased with increasing scoring categories (κcategory0 = 0.60; κcategory1 = 0.53; κcategory2 = 0.37; κcategory3 = 0.32).
Non-pathologist versus non-pathologist
Interobserver agreement among non-pathologists is shown in Table 3 (non-weighted kappa values in Supplementary Table S5). Overall kappa values between all three non-pathologists were similar to those comparing the combination score and the pathologist's score (κnuclear 0.78 vs. 0.74; κmembranous 0.72 vs. 0.73; κcytoplasmic 0.61 vs. 0.56, respectively). Scores for nuclear and membranous immunoreactivity showed the highest kappa values among non-pathologists, with an overall weighted kappa of 0.78 (κrange = 0.74–0.80) and 0.72 (κrange = 0.66–0.81), respectively. Agreement was lowest for cytoplasmic immunoreactivity, with an overall kappa of 0.61 (κrange = 0.55–0.65). In the majority of non-pathologists' scores (range, 96.2%–99.8%), equal or adjacent scoring categories were assigned (Supplementary Table S6).
Intraobserver agreement of non-pathologists
Weighted intraobserver kappa values of two non-pathologists are shown in Table 5 (non-weighted kappa values in Supplementary Table S7). The intraobserver agreement was highest for scoring nuclear and membranous immunoreactivity, showing “almost perfect” agreement (κobserver2 = 0.83; κobserver3 = 0.87), and “substantial” to “almost perfect” agreement (κobserver2 = 0.82; κobserver3 = 0.75), respectively. Scoring of cytoplasmic immunoreactivity showed “substantial” agreement (κobserver2 = 0.69; κobserver3 = 0.69). In the majority of scores (range, 98.9%–100%), equal or adjacent categories were assigned at the first and second timepoint (Supplementary Table S6).
. | Assessor 2 . | Assessor 3 . |
---|---|---|
. | κ (95% CI) . | κ (95% CI) . |
Nuclear | 0.83 (0.80–0.86) | 0.87 (0.84–0.90) |
Membranous | 0.82 (0.79–0.85) | 0.75 (0.72–0.78) |
Cytoplasmic | 0.69 (0.64–0.74) | 0.69 (0.64–0.74) |
. | Assessor 2 . | Assessor 3 . |
---|---|---|
. | κ (95% CI) . | κ (95% CI) . |
Nuclear | 0.83 (0.80–0.86) | 0.87 (0.84–0.90) |
Membranous | 0.82 (0.79–0.85) | 0.75 (0.72–0.78) |
Cytoplasmic | 0.69 (0.64–0.74) | 0.69 (0.64–0.74) |
Note: Nuclear, TP53; membranous, GLUT1; cytoplasmic, PTEN. CI, confidence interval.
Discussion
TMAs are increasingly used to analyze protein expression by IHC in large-scale studies (2, 3, 5, 37). Scoring is often done by non-pathologists (17, 18); however, only few studies reported validity and reproducibility of scoring results (38, 39). To the best of our knowledge, our study is one of the first to investigate agreement of TMA-based scoring of immunoreactivity in different subcellular localizations by non-pathologists. Our study showed that interobserver agreement between an experienced histopathologist and trained non-pathologists was “moderate” to “substantial.” Agreement with the pathologist's score did not further increase when a combination score from three instead of two trained non-pathologists was used.
Interobserver agreement non-pathologists versus pathologist
Our study demonstrates that non-pathologists can generate reproducible results. These results are in line with a previous study by Jaraj and colleagues (19), reporting comparable kappa values for interobserver agreement between pathologists and non-pathologists. Even though it was not their main objective, two other studies reported comparable interobserver agreement between pathologists and non-pathologists (22, 40). However, some of the studies reported weighted kappa values (19, 22), but did not state what weights were assigned to adjacent scoring categories, making a direct comparison of kappa values with our study impossible.
Considering the subjectivity of immunoreactivity scoring, several studies recommended that scoring should be done by multiple assessors to improve interobserver agreement (39, 41, 42). Our study confirmed that combining scores from multiple non-pathologists into a combination score increased interobserver agreement with the pathologist's score. Combining scores of three non-pathologists instead of two did not change interobserver agreement with the pathologist, indicating that IHC scoring by two non-pathologists seems to be sufficient to yield reliable IHC results.
Immunoreactivity scoring in different subcellular localizations
A limited number of studies investigated scoring agreement of immunoreactivity in different subcellular localizations, showing inconsistent results (21–23). We showed that scoring of nuclear and membranous immunoreactivity generally leads to higher interobserver agreement compared with cytoplasmic immunoreactivity, consistent with results of Bolton and colleagues (23). However, this is in contrast to two other studies which did not find a difference in the intraobserver and interobserver agreement when scoring nuclear, membranous and cytoplasmic immunoreactivity (21, 22). These discrepant results might be explained by the use of different IHC scoring methods between studies.
The IHC markers selected for the current study were chosen to provide a range of subcellular localizations (nucleus/membrane/cytoplasm) for scoring purposes. These markers are generalizable to other IHC stainings considering the subcellular localization.
Interobserver agreement among non-pathologists
Hitherto, few studies reported interobserver agreement of IHC results among non-pathologists. In the current study, we found “substantial” to “almost perfect” agreement among trained non-pathologists, which is in line with previously published results on TMAs and whole tissue sections (17–19).
Intraobserver agreement of non-pathologists
IHC studies often report intraobserver kappa values as a measure of reproducibility. Our study shows that non-pathologists are able to generate reproducible IHC scores after appropriate training, which is in line with previous studies (17–19, 40). Interestingly, intraobserver kappa values of non-pathologists in the current study were similar to those previously reported for pathologists (23, 43). In general, across all three markers, disagreements were limited to one-category discordances (e.g., <10% vs. 11%–50%) for all comparisons.
Limitations
Our study has some limitations. We have no information on intraobserver and interobserver agreement of pathologists, as this was beyond the scope of this article. Furthermore, the current study used TMA cores to assess interobserver and intraobserver agreement. It has been described in the literature that interobserver agreement increases when using TMA cores compared with whole tissue sections (14). Thus, it remains to be clarified whether the agreement among non-pathologists and between non-pathologists and pathologists is similar in full tissue sections. However, the aim of this study was specifically to investigate IHC scoring on TMAs, because non-pathologists will mainly be involved in IHC scoring in large-scale studies using TMAs. Also, we did not directly compare the scoring performance between trained non-pathologists and untrained non-pathologists; thus, we are not able to draw direct conclusions on the necessity of training, and in particular whether similar results would have been obtained without training.
Recommendations
We propose some recommendations which could improve comparability of IHC studies. First, it is important to report what weights were used for analyses of weighted kappa values. In addition, we think it would be of value to report both weighted and non-weighted kappa values. Second, it should be mentioned clearly in the methods what the IHC scoring experience of assessors was. If done by non-pathologists, it is important to report their training. Third, our results showed that disagreements were mostly limited to one-category discordances, suggesting that less refined scoring protocols may potentially improve agreement. This is in line with previous studies (44, 45), in which the authors showed that agreement improved when using scoring protocols with less categories. However, we acknowledge that the number of categories of the scoring protocol depends on the novelty and clinical relevance of the biomarker being studied. Scoring protocols for potential new biomarkers might comprise more categories compared with well-known biomarkers. Finally, we suggest that IHC scoring should be performed by at least two non-pathologists to be able to assess interobserver agreement among assessors. Ideally, these non-pathologists are trained by an expert pathologist and a certain percentage of samples (e.g., 10%) are double-scored by the pathologist to ensure quality of scoring.
Conclusion
In this large study investigating interobserver and intraobserver agreement of TMA-based immunoreactivity scores between pathologists and non-pathologists, we have shown that non-pathologists can generate reproducible IHC scoring results that are similar to those of an experienced pathologist. A combination score of at least two non-pathologists yielded optimal results. Future studies are required to validate our findings and to examine the practical implications and impact of potential misclassification, by comparing effect estimates for established stain-outcome associations when using the pathologist's score versus the non-pathologists' combination score.
Authors' Disclosures
No disclosures were reported.
Authors' Contributions
J.C.A. Jenniskens: Conceptualization, formal analysis, investigation, writing–original draft. K. Offermans: Conceptualization, formal analysis, investigation, writing–original draft. I. Samarska: Conceptualization, investigation, writing–original draft. G.E. Fazzi: Investigation, writing–review and editing. C.C.J.M. Simons: Writing–review and editing. K.M. Smits: Writing–review and editing. L.J. Schouten: Writing–review and editing. M.P. Weijenberg: Writing–review and editing. P.A. van den Brandt: Conceptualization, resources, supervision, funding acquisition, writing–original draft, project administration. H.I. Grabsch: Conceptualization, supervision, writing–original draft.
Acknowledgments
The authors would like to thank the participants of the Netherlands Cohort Study (NLCS), the Netherlands Cancer Registry, and the Dutch Pathology Registry. They are grateful to Ron Alofs and Harry van Montfort for data management and programming assistance; to Jaleesa van der Meer, Edith van den Boezem, and Peter Moerkerk for TMA construction; and the University of Leeds (Leeds, UK) for scanning of all slides.
The Rainbow-TMA consortium was financially supported by BBMRI-NL, a Research Infrastructure financed by the Dutch government (NWO 184.021.007, to P.A. van den Brandt), and Maastricht University Medical Center, University Medical Center Utrecht, and Radboud University Medical Centre, the Netherlands. The authors would like to thank all investigators from the Rainbow-TMA consortium project group [P.A. van den Brandt, A. zur Hausen, H.I. Grabsch, M. van Engeland, L.J. Schouten, J. Beckervordersandforth (Maastricht University Medical Center+, Maastricht, the Netherlands); P.H.M. Peeters, P.J. van Diest, H.B. Bueno de Mesquita (University Medical Center Utrecht, Utrecht, the Netherlands); J. van Krieken, I. Nagtegaal, B. Siebers, B. Kiemeney (Radboud University Medical Center, Nijmegen, the Netherlands); F.J. van Kemenade, C. Steegers, D. Boomsma, G.A. Meijer (Amsterdam University Medical Center, locaties VUmc, the Netherlands); F.J. van Kemenade, B. Stricker (Erasmus University Medical Center, Rotterdam, the Netherlands); L. Overbeek, A. Gijsbers (PALGA, the Nationwide Histopathology and Cytopathology Data Network and Archive, Houten, the Netherlands)] and collaborating pathologists [among others: A. de Bruïne (VieCuri Medical Center, Venlo); J.C. Beckervordersandforth (Maastricht University Medical Center+, Maastricht); J. van Krieken, I. Nagtegaal (Radboud University Medical Center, Nijmegen); W. Timens (University Medical Center Groningen, Groningen); F.J. van Kemenade (Erasmus University Medical Center, Rotterdam); M.C.H. Hogenes (Laboratory for Pathology OostNederland, Hengelo); P.J. van Diest (University Medical Center Utrecht, Utrecht); R.E. Kibbelaar (Pathology Friesland, Leeuwarden); A.F. Hamel (Stichting Samenwerkende Ziekenhuizen Oost-Groningen, Winschoten); A.T.M.G. Tiebosch (Martini Hospital, Groningen); C. Meijers (Reinier de Graaf Gasthuis/S.S.D.Z., Delft); R. Natté (Haga Hospital Leyenburg, The Hague); G.A. Meijer (Amsterdam University Medical Center, locatie VUmc); J.J.T.H. Roelofs (Amsterdam University Medical Center, locatie AMC); R.F. Hoedemaeker (Pathology Laboratory Pathan, Rotterdam); S. Sastrowijoto (Orbis Medical Center, Sittard); M. Nap (Atrium Medical Center, Heerlen); H.T. Shirango (Deventer Hospital, Deventer); H. Doornewaard (Gelre Hospital, Apeldoorn); J.E. Boers (Isala Hospital, Zwolle); J.C. van der Linden (Jeroen Bosch Hospital, Den Bosch); G. Burger (Symbiant Pathology Center, Alkmaar); R.W. Rouse (Meander Medical Center, Amersfoort); P.C. de Bruin (St. Antonius Hospital, Nieuwegein); P. Drillenburg (Onze Lieve Vrouwe Gasthuis, Amsterdam); C. van Krimpen (Kennemer Gasthuis, Haarlem); J.F. Graadt van Roggen (Diaconessenhuis, Leiden); S.A.J. Loyson (Bronovo Hospital, The Hague); J.D. Rupa (Laurentius Hospital, Roermond); H. Kliffen (Maasstad Hospital, Rotterdam); H.M. Hazelbag (Medical Center Haaglanden, The Hague); K. Schelfout (Stichting Pathologisch en Cytologisch Laboratorium West-Brabant, Bergen op Zoom); J. Stavast (Laboratorium Klinische Pathologie Centraal Brabant, Tilburg); I. van Lijnschoten (PAMM Laboratory for Pathology and Medical Microbiology, Eindhoven); K. Duthoi (Amphia Hospital, Breda)].
This project was funded by The Dutch Cancer Society (KWF 11044, to P.A. van den Brandt).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.