CUSTOMIZED INFORMATION EXTRACTION AS A BASIS FOR RESOURCE DISCOVERY

Citation
Dr. Hardy et Mf. Schwartz, CUSTOMIZED INFORMATION EXTRACTION AS A BASIS FOR RESOURCE DISCOVERY, ACM transactions on computer systems, 14(2), 1996, pp. 171-199
Citations number
39
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
ISSN journal
07342071
Volume
14
Issue
2
Year of publication
1996
Pages
171 - 199
Database
ISI
SICI code
0734-2071(1996)14:2<171:CIEAAB>2.0.ZU;2-K
Abstract
Indexing file contents is a powerful means of helping users locate doc uments, software, and other types of data among large repositories. In environments that contain many different types of data, content index ing requires type-specific processing to extract information effective ly. We present a model for type-specific, user-customizable informatio n extraction, and a system implementation called Essence, This softwar e structure allows users to associate specialized extraction methods w ith ordinary files, providing the illusion of an object-oriented file system that encapsulates indexing methods within files. By exploiting the semantics of common file types, Essence generates compact yet repr esentative file summaries that can be used to improve both browsing an d indexing in resource discovery systems. Essence can extract informat ion from most of the types of files found in common file systems, incl uding files with nested structure (such as compressed ''tar'' files). Essence interoperates with a number of different search/index systems (such as WAIS and Glimpse), as part of the Harvest system.