Hc. Chen et al., A PARALLEL COMPUTING APPROACH TO CREATING ENGINEERING CONCEPT SPACES FOR SEMANTIC RETRIEVAL - THE ILLINOIS DIGITAL LIBRARY INITIATIVE PROJECT, IEEE transactions on pattern analysis and machine intelligence, 18(8), 1996, pp. 771-782
This research presents preliminary results generated from the semantic
retrieval research component of the illinois Digital Library Initiati
ve (DLI) project. Using a variation of the automatic thesaurus generat
ion techniques, to which we refer as the concept space approach, we ai
med to create graphs of domain-specific concepts (terms) and their wei
ghted co-occurrence relationships for all major engineering domains. M
erging these concept spaces and providing traversal paths across:diffe
rent concept spaces could potentially help alleviate the vocabulary (d
ifference) problem evident in large-scale information retrieval. We ha
ve experimented previously with such a technique for a smaller molecul
ar biology domain (Worm Community System, with 10+ MBs of document col
lection) with encouraging results. In order to address the scalability
issue related to large-scale information retrieval and analysis for t
he current Illinois DLI project, we recently conducted experiments usi
ng the concept space approach on parallel supercomputers. Our test col
lection included 2+ GBs of computer science and electrical engineering
abstracts extracted from the INSPEC database. The concept space appro
ach called for extensive textual and statistical analysis (a form of k
nowledge discovery) based on automatic indexing and cooccurrence analy
sis algorithms, both previously tested in the biology domain. Initial
testing results using a 512-node CM-5 and a 16-processor SGI Power Cha
llenge were promising. Power Challenge was later selected to create a
comprehensive computer engineering concept space of about 270,000 term
s and 4,000,000+ links using 24.5 hours of CPU time. Our system evalua
tion involving 12 knowledgeable subjects revealed that the automatical
ly-created computer engineering concept space generated significantly
higher concept recall than the human-generated INSPEC computer enginee
ring thesaurus. However, the INSPEC was more precise than the automati
c concept space. Our current work mainly involves creating concept spa
ces for other major engineering domains and developing robust graph ma
tching and traversal algorithms for cross-domain, concept-based retrie
val. Future work also will include generating individualized concept s
paces for assisting user-specific concept-based information retrieval.