ITA
ENG

The suffix-signature method for searching for phrases in text

Authors

Zhou, M Tompa, FW

Citation

M. Zhou et Fw. Tompa, The suffix-signature method for searching for phrases in text, INF SYST, 23(8), 1998, pp. 567-588

Citations number

Categorie Soggetti

Information Tecnology & Communication Systems

Journal title

INFORMATION SYSTEMS

ISSN journal

03064379 → ACNP

Volume

Issue

Year of publication

1998

Pages

567 - 588

Database

ISI

SICI code

0306-4379(199812)23:8<567:TSMFSF>2.0.ZU;2-W

Abstract

We present a new algorithm to find all occurrences of a given phrase based on the data structure known as a suffix array and using a corresponding arr ay of signatures. With this algorithm, matches to phrases of moderate lengt h can be found with expected search time of one disk access to the text and one disk access to its index. To achieve this performance for phrases of u p to five words in length requires an index having total size of approximat ely 120% of the size of the text. The algorithm guarantees a worst case sea rch performance of two disk accesses to the text per phrase search. Experim ents with actual data ranging in size from 6Mb to 550Mb and with actual que ry patterns derived from logs of searches on the World Wide Web show that t he approach is applicable in practice to a variety of texts and realistic p hrase searches. (C)1998 Elsevier Science Ltd. All rights reserved.