The Protein Annotators' Assistant (or PAA) (http://www.ebi.ac.uk/paa/) is a
software system which assists protein annotators in the task of assigning
functions to newly sequenced proteins. Working backward from SwissProt, a d
atabase which describes known proteins, and a prior sequence similarity sea
rch that returns a list of known proteins similar to a query, PAA suggests
keywords and phrases which may describe functions performed by the query. I
n a preprocessing step, a database is built from the protein names that app
ear in the SwissProt database, and against each protein are listed key word
s and phrases that are extracted from the corresponding text records. Commo
n words either in general English usage or from the biological domain are r
emoved as the phrases are assembled, This process is assisted by the use of
a simple stemming algorithm, which extends the list of stop-words (i.e., r
eject words), together with a list of accept-words. At runtime, the search
algorithm, invoked by a user via a Web interface, takes a list of protein n
ames and clusters the named proteins around keywords/phrases shared by memb
ers of the list. The assumption is that if these proteins have a particular
keyword/phrase in common, and they are related to a query protein, then th
e keyword/phrase may also describe the query. Overall, PAA employs a number
of in techniques in a novel setting and is thus related to text categoriza
tion, where multiple categories may be suggested, except that in this case
none of the categories are specified in advance.