Aa. Schaffer et al., Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements, NUCL ACID R, 29(14), 2001, pp. 2994-3005
PSI-BLAST is an iterative program to search a database for proteins with di
stant similarity to a query sequence. We investigated over a dozen modifica
tions to the methods used in PSI-BLAST, with the goal of improving accuracy
in finding true positive matches. To evaluate performance we used a set of
103 queries for which the true positives in yeast had been annotated by hu
man experts, and a popular measure of retrieval accuracy (ROC) that can be
normalized to take on values between 0 (worst) and 1 (best). The modificati
ons we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 /- 0.003. This does not include the benefits from four modifications we inc
luded in the 'baseline' version, even though they were not implemented in P
SI-BLAST version 2.0. The improvement in accuracy was confirmed on a small
second test set. This test involved analyzing three protein families with c
urated lists of true positives from the non-redundant protein database. The
modification that accounts for the majority of the improvement is the use,
for each database sequence, of a position-specific scoring system tuned to
that sequence's amino acid composition. The use of composition-based stati
stics is particularly beneficial for large-scale automated applications of
PSI-BLAST.