Integrating a large number of Web information sources may significantly inc
rease the utility of the World-Wide Web. A promising solution to the integr
ation is through the use of a Web Information mediator that provides seamle
ss, transparent access for the clients. Information mediators need wrappers
to access a Web source as a structured database, but building wrappers by
hand is impractical. Previous work on wrapper induction is too restrictive
to handle a large number of Web pages that contain tuples with missing attr
ibutes, multiple values, variant attribute permutations, exceptions and typ
os. This paper presents SoftMealy, a novel wrapper representation formalism
. This representation is based on a finite-state transducer (FST) and conte
xtual rules. This approach can wrap a wide range of semistructured Web page
s because FSTs can encode each different attribute permutation as a path. A
SoftMealy wrapper can be induced from a handful of labeled examples using
our generalization algorithm. We have implemented this approach into a prot
otype system and tested it on real Web pages. The performance statistics sh
ows that the sizes of the induced wrappers as well as the required training
effort are linear with regard to the structural variance of the test pages
. Our experiment also shows that the induced wrappers can generalize over u
nseen pages. (C)1998 Elsevier Science Ltd. All rights reserved.