We consider the problem of learning to perform information extraction in do
mains where linguistic processing is problematic, such as Usenet posts, ema
il, and finger plan files. In place of syntactic and semantic information,
other sources of information can be used, such as term frequency, typograph
y, formatting, and mark-up. We describe four learning approaches to this pr
oblem, each drawn from a different paradigm: a rote learner, a term-space l
earner based on Naive Bayes, an approach using grammatical induction, and a
relational rule learner. Experiments on 14 information extraction problems
defined over four diverse document collections demonstrate the effectivene
ss of these approaches. Finally, we describe a multistrategy approach which
combines these learners and yields performance competitive with or better
than the best of them. This technique is modular and flexible, and could fi
nd application in other machine learning problems.