In many real-world learning tasks it is expensive to acquire a sufficient n
umber of labeled examples for training. This paper investigates methods for
reducing annotation cost by sample selection. In this approach, during tra
ining the learning program examines many unlabeled examples and selects for
labeling only those that are most informative at each stage. This avoids r
edundantly labeling examples that contribute little new information.
Our work follows on previous research on Query By Committee, and extends th
e committee-based paradigm to the context of probabilistic classification.
We describe a family of empirical methods for committee-based sample select
ion in probabilistic classification models, which evaluate the informativen
ess of an example by measuring the degree of disagreement between several m
odel variants. These variants (the committee) are drawn randomly from a pro
bability distribution conditioned by the training set labeled so far.
The method was applied to the real-world natural language processing task o
f stochastic part-of-speech tagging. We find that all variants of the metho
d achieve a significant reduction in annotation cost, although their comput
ational efficiency differs. In particular, the simplest variant, a two memb
er committee with no parameters to tune, gives excellent results. We also s
how that sample selection yields a significant reduction in the size of the
model used by the tagger.