The suffix-signature method for searching for phrases in text

Authors
Citation
M. Zhou et Fw. Tompa, The suffix-signature method for searching for phrases in text, INF SYST, 23(8), 1998, pp. 567-588
Citations number
33
Categorie Soggetti
Information Tecnology & Communication Systems
Journal title
INFORMATION SYSTEMS
ISSN journal
03064379 → ACNP
Volume
23
Issue
8
Year of publication
1998
Pages
567 - 588
Database
ISI
SICI code
0306-4379(199812)23:8<567:TSMFSF>2.0.ZU;2-W
Abstract
We present a new algorithm to find all occurrences of a given phrase based on the data structure known as a suffix array and using a corresponding arr ay of signatures. With this algorithm, matches to phrases of moderate lengt h can be found with expected search time of one disk access to the text and one disk access to its index. To achieve this performance for phrases of u p to five words in length requires an index having total size of approximat ely 120% of the size of the text. The algorithm guarantees a worst case sea rch performance of two disk accesses to the text per phrase search. Experim ents with actual data ranging in size from 6Mb to 550Mb and with actual que ry patterns derived from logs of searches on the World Wide Web show that t he approach is applicable in practice to a variety of texts and realistic p hrase searches. (C)1998 Elsevier Science Ltd. All rights reserved.