Evaluating natural language processing (NLP) systems in the clinical domain
is a difficult task which is important for advancement of the field, A num
ber of NLP systems have been reported that extract information from free-te
xt clinical reports, but not many of the systems have been evaluated. Those
that were evaluated noted good performance measures but the results were o
ften weakened by ineffective evaluation methods. In this paper we describe
a set of criteria aimed at improving the quality of NLP evaluation studies.
We present an overview of NLP evaluations in the clinical domain and also
discuss the Message Understanding Conferences (MUC) [1-4]. Although these c
onferences constitute a series of NLP evaluation studies performed outside
of the clinical domain, some of the results are relevant within medicine. I
n addition, we discuss a number of factors which contribute to the complexi
ty that is inherent in the task of evaluating natural language systems.