We describe the results of extensive experiments using optimized rule-
based induction methods on large document collections. The goal of the
se methods is to discover automatically classification patterns that c
an be used for general document categorization or personalized filteri
ng of free text. Previous reports indicate that human-engineered rule-
based systems, requiring many man-years of developmental efforts, have
been successfully built to ''read'' documents and assign topics to th
em. We show that machine-generated decision rules appear comparable to
human performance, while using the identical rule-based representatio
n. In comparison with other machine-learning techniques, results on a
key benchmark from the Reuters collection show a large gain in perform
ance, from a previously reported 67% recall/precision breakeven point
to 80.5%. In the context of a very high-dimensional feature space, sev
eral methodological alternatives are examined, including universal ver
sus local dictionaries, and binary versus frequency-related features.