A reliability studs for evaluating information extraction from radiology reports

Citation
G. Hripcsak et al., A reliability studs for evaluating information extraction from radiology reports, J AM MED IN, 6(2), 1999, pp. 143-150
Citations number
23
Categorie Soggetti
Library & Information Science","General & Internal Medicine
Journal title
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
ISSN journal
10675027 → ACNP
Volume
6
Issue
2
Year of publication
1999
Pages
143 - 150
Database
ISI
SICI code
1067-5027(199903/04)6:2<143:ARSFEI>2.0.ZU;2-C
Abstract
Goal: To assess the reliability of a reference standard for an information extraction task. Setting: Twenty-four physician raters from two sites and two specialties ju dged whether clinical conditions were present based on reading chest radiog raph reports. Methods: Variance components, generalizability (reliability) coefficients, and the number of expert raters needed to generate a reliable reference sta ndard were estimated. Results: Per-rater reliability averaged across conditions was 0.80 (95% CI, 0.79-0.81). Reliability for the nine individual conditions varied from 0.6 7 to 0.97, with central line presence and pneumothorax the most reliable, a nd pleural effusion (excluding CHF) and pneumonia the least reliable. One t o two raters were needed to achieve a reliability of 0.70, and six raters, on average, were required to achieve a reliability of 0.95. This was far mo re reliable than a previously published per-rater reliability of 0.19 for a more complex task. Differences between sites were attributable to changes to the condition definitions. Conclusion: In these evaluations, physician raters were able to judge very reliably the presence of clinical conditions based on text reports. Once th e reliability of a specific rater is confirmed, it would be possible for th at rater to create a reference standard reliable enough to assess aggregate measures on a system. Six raters would be needed to create a reference sta ndard sufficient to assess a system on a case-by-case basis. These results should help evaluators design future information extraction studies for nat ural language processors and other knowledge-based systems.