Dr. Hardy et Mf. Schwartz, CUSTOMIZED INFORMATION EXTRACTION AS A BASIS FOR RESOURCE DISCOVERY, ACM transactions on computer systems, 14(2), 1996, pp. 171-199
Citations number
39
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
Indexing file contents is a powerful means of helping users locate doc
uments, software, and other types of data among large repositories. In
environments that contain many different types of data, content index
ing requires type-specific processing to extract information effective
ly. We present a model for type-specific, user-customizable informatio
n extraction, and a system implementation called Essence, This softwar
e structure allows users to associate specialized extraction methods w
ith ordinary files, providing the illusion of an object-oriented file
system that encapsulates indexing methods within files. By exploiting
the semantics of common file types, Essence generates compact yet repr
esentative file summaries that can be used to improve both browsing an
d indexing in resource discovery systems. Essence can extract informat
ion from most of the types of files found in common file systems, incl
uding files with nested structure (such as compressed ''tar'' files).
Essence interoperates with a number of different search/index systems
(such as WAIS and Glimpse), as part of the Harvest system.