Goal: To assess the reliability of a reference standard for an information
extraction task.
Setting: Twenty-four physician raters from two sites and two specialties ju
dged whether clinical conditions were present based on reading chest radiog
raph reports.
Methods: Variance components, generalizability (reliability) coefficients,
and the number of expert raters needed to generate a reliable reference sta
ndard were estimated.
Results: Per-rater reliability averaged across conditions was 0.80 (95% CI,
0.79-0.81). Reliability for the nine individual conditions varied from 0.6
7 to 0.97, with central line presence and pneumothorax the most reliable, a
nd pleural effusion (excluding CHF) and pneumonia the least reliable. One t
o two raters were needed to achieve a reliability of 0.70, and six raters,
on average, were required to achieve a reliability of 0.95. This was far mo
re reliable than a previously published per-rater reliability of 0.19 for a
more complex task. Differences between sites were attributable to changes
to the condition definitions.
Conclusion: In these evaluations, physician raters were able to judge very
reliably the presence of clinical conditions based on text reports. Once th
e reliability of a specific rater is confirmed, it would be possible for th
at rater to create a reference standard reliable enough to assess aggregate
measures on a system. Six raters would be needed to create a reference sta
ndard sufficient to assess a system on a case-by-case basis. These results
should help evaluators design future information extraction studies for nat
ural language processors and other knowledge-based systems.