Objective. The objective of this paper is to introduce, explain, and e
xtend methods for comparing the performance of classification algorith
ms using error tallies obtained on properly sized, populated, and labe
led data sets. Methods. Two distinct contexts of classification are de
fined, involving ''objects-by-inspection'' and ''objects-by-segmentati
on.'' In the former context, the total number of objects to be classif
ied is unambiguously and self-evidently defined. In the latter, there
is troublesome ambiguity. All five of the measures of performance here
considered are based on confusion matrices, tables of counts revealin
g the extent of an algorithm's ''confusion'' regarding the true classi
fications. A proper measure of classification-algorithm performance mu
st meet four requirements. A proper measure should obey six additional
constraints. Results. Four traditional measures of performance are cr
itiqued in terms of the requirements and constraints. Each measure mee
ts the requirements, but fails to obey at least one of the constraints
. A nontraditional measure of algorithm performance, the normalized mu
tual information (NMI), is therefore introduced. Based on the NMI, met
hods for comparing algorithm performance using confusion matrices are
devised. Conclusions. The five performance measures lead to similar in
ferences when comparing a trio of QRS-detection algorithms using a lar
ge data set. The modified NMI is preferred, however, because it obeys
each of the constraints and is the most conservative measure of perfor
mance.