The World Wide Web is a vast source of information accessible to computers,
but understandable only to humans. The goal of the research described here
is to automatically create a computer understandable knowledge base whose
content mirrors that of the World Wide Web. Such a knowledge base would ena
ble much more effective retrieval of Web information, and promote new uses
of the Web to support knowledge-based inference and problem solving. Our ap
proach is to develop a trainable information extraction system that takes t
wo inputs. The first is an ontology that defines the classes (e.g,, company
, person, employee, product) and relations (e.g., employed by, produced by)
of interest when creating the knowledge base. The second is a set of train
ing data consisting of labeled regions of hypertext that represent instance
s of these classes and relations. Given these inputs, the system learns to
extract information from other pages and hyperlinks on the Web. This articl
e describes our general approach, several machine learning algorithms for t
his task, and promising initial results with a prototype system that has cr
eated a knowledge base describing university people, courses, and research
projects. (C) 2000 Elsevier Science B.V. All rights reserved.