This study investigated whether document retrieval can be improved if docum
ents are divided into smaller sub-documents or passages and the retrieval s
core for these passages are incorporated in the final retrieval score for t
he whole document. The documents were segmented by sliding a window of a ce
rtain size across the document and extracting the words displayed each time
the window stopped. A retrieval score was calculated for each of the passa
ges extracted and the highest score obtained by a passage of that size was
taken as the document's passage-level score for that window size. A range o
f window sizes was tried.
The experimental results indicated that using a fixed window size of 50 wor
ds gave better results than other window sizes for the TREC-5 and TREC-6 te
st collections. This window size yielded a significant retrieval improvemen
t of 24% compared to using the whole-document retrieval score (using the tr
aditional tf*idf weighting scheme with cosine normalisation). However, comb
ining this window score and the whole-document retrieval score did not yiel
d a retrieval improvement.
Using a variable window size (ranging from 50 to 400 words) yielded a retri
eval improvement of about 5% over using a fixed window size of 50. Differen
t window sizes were found to work best for different queries. If the best w
indow size to use for each query could be predicted accurately, a maximum r
etrieval improvement of 42% could be obtained.
Subsequent work suggests that the usefulness of passage-level evidence in d
ocument retrieval depends on the weighting scheme and type of normalisation
used in the retrieval method.