A PARALLEL COMPUTING APPROACH TO CREATING ENGINEERING CONCEPT SPACES FOR SEMANTIC RETRIEVAL - THE ILLINOIS DIGITAL LIBRARY INITIATIVE PROJECT

Citation
Hc. Chen et al., A PARALLEL COMPUTING APPROACH TO CREATING ENGINEERING CONCEPT SPACES FOR SEMANTIC RETRIEVAL - THE ILLINOIS DIGITAL LIBRARY INITIATIVE PROJECT, IEEE transactions on pattern analysis and machine intelligence, 18(8), 1996, pp. 771-782
Citations number
41
Categorie Soggetti
Computer Sciences","Computer Science Artificial Intelligence","Engineering, Eletrical & Electronic
ISSN journal
01628828
Volume
18
Issue
8
Year of publication
1996
Pages
771 - 782
Database
ISI
SICI code
0162-8828(1996)18:8<771:APCATC>2.0.ZU;2-5
Abstract
This research presents preliminary results generated from the semantic retrieval research component of the illinois Digital Library Initiati ve (DLI) project. Using a variation of the automatic thesaurus generat ion techniques, to which we refer as the concept space approach, we ai med to create graphs of domain-specific concepts (terms) and their wei ghted co-occurrence relationships for all major engineering domains. M erging these concept spaces and providing traversal paths across:diffe rent concept spaces could potentially help alleviate the vocabulary (d ifference) problem evident in large-scale information retrieval. We ha ve experimented previously with such a technique for a smaller molecul ar biology domain (Worm Community System, with 10+ MBs of document col lection) with encouraging results. In order to address the scalability issue related to large-scale information retrieval and analysis for t he current Illinois DLI project, we recently conducted experiments usi ng the concept space approach on parallel supercomputers. Our test col lection included 2+ GBs of computer science and electrical engineering abstracts extracted from the INSPEC database. The concept space appro ach called for extensive textual and statistical analysis (a form of k nowledge discovery) based on automatic indexing and cooccurrence analy sis algorithms, both previously tested in the biology domain. Initial testing results using a 512-node CM-5 and a 16-processor SGI Power Cha llenge were promising. Power Challenge was later selected to create a comprehensive computer engineering concept space of about 270,000 term s and 4,000,000+ links using 24.5 hours of CPU time. Our system evalua tion involving 12 knowledgeable subjects revealed that the automatical ly-created computer engineering concept space generated significantly higher concept recall than the human-generated INSPEC computer enginee ring thesaurus. However, the INSPEC was more precise than the automati c concept space. Our current work mainly involves creating concept spa ces for other major engineering domains and developing robust graph ma tching and traversal algorithms for cross-domain, concept-based retrie val. Future work also will include generating individualized concept s paces for assisting user-specific concept-based information retrieval.