We present a new algorithm to find all occurrences of a given phrase based
on the data structure known as a suffix array and using a corresponding arr
ay of signatures. With this algorithm, matches to phrases of moderate lengt
h can be found with expected search time of one disk access to the text and
one disk access to its index. To achieve this performance for phrases of u
p to five words in length requires an index having total size of approximat
ely 120% of the size of the text. The algorithm guarantees a worst case sea
rch performance of two disk accesses to the text per phrase search. Experim
ents with actual data ranging in size from 6Mb to 550Mb and with actual que
ry patterns derived from logs of searches on the World Wide Web show that t
he approach is applicable in practice to a variety of texts and realistic p
hrase searches. (C)1998 Elsevier Science Ltd. All rights reserved.