Learning models to classify rarely occurring target classes is an important
problem with applications in network intrusion detection, fraud detection,
or deviation detection in general. In this paper,we analyze our previously
proposed two-phase rule induction method in the context of learning comple
te and precise signatures of rare classes. The key feature of our method is
that it separately conquers the objectives of achieving high recall and hi
gh precision for the given target class. The first phase of the method aims
for high recall by inducing rules with high support and a reasonable level
of accuracy. The second phase then tries to improve the precision by learn
ing rules to remove false positives in the collection of the records covere
d by the first phase rules. Existing sequential covering techniques try to
achieve high precision for each individual disjunct learned. In this paper,
we claim that such approach is inadequate for rare classes, because of two
problems: splintered false positives and error-prone small disjuncts. Moti
vated by the strengths of our two-phase design, we design various synthetic
data models to identify and analyze the situations in which two state-of-t
he-art methods, RIPPER and C4.5rules, either fail to learn a model or learn
a very poor model. In all these situations, our two-phase approach learns
a model with significantly better recall and precision levels. We also pres
ent a comparison of the three methods on a challenging real-life network in
trusion detection dataset. Our method is significantly better or comparable
to the best competitor in terms of achieving better balance between recall
and precision.