Committee-based sample selection for probabilistic classifiers

Citation
S. Argamon-engelson et I. Dagan, Committee-based sample selection for probabilistic classifiers, J ARTIF I R, 11, 1999, pp. 335-360
Citations number
34
Categorie Soggetti
AI Robotics and Automatic Control
Journal title
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
ISSN journal
10769757 → ACNP
Volume
11
Year of publication
1999
Pages
335 - 360
Database
ISI
SICI code
1076-9757(1999)11:<335:CSSFPC>2.0.ZU;2-D
Abstract
In many real-world learning tasks it is expensive to acquire a sufficient n umber of labeled examples for training. This paper investigates methods for reducing annotation cost by sample selection. In this approach, during tra ining the learning program examines many unlabeled examples and selects for labeling only those that are most informative at each stage. This avoids r edundantly labeling examples that contribute little new information. Our work follows on previous research on Query By Committee, and extends th e committee-based paradigm to the context of probabilistic classification. We describe a family of empirical methods for committee-based sample select ion in probabilistic classification models, which evaluate the informativen ess of an example by measuring the degree of disagreement between several m odel variants. These variants (the committee) are drawn randomly from a pro bability distribution conditioned by the training set labeled so far. The method was applied to the real-world natural language processing task o f stochastic part-of-speech tagging. We find that all variants of the metho d achieve a significant reduction in annotation cost, although their comput ational efficiency differs. In particular, the simplest variant, a two memb er committee with no parameters to tune, gives excellent results. We also s how that sample selection yields a significant reduction in the size of the model used by the tagger.