Protein Annotators' Assistant: A novel application of information retrieval techniques

Authors
Citation
Mj. Wise, Protein Annotators' Assistant: A novel application of information retrieval techniques, J AM S INFO, 51(12), 2000, pp. 1131-1136
Citations number
21
Categorie Soggetti
Library & Information Science
Journal title
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE
ISSN journal
00028231 → ACNP
Volume
51
Issue
12
Year of publication
2000
Pages
1131 - 1136
Database
ISI
SICI code
0002-8231(200010)51:12<1131:PAAANA>2.0.ZU;2-I
Abstract
The Protein Annotators' Assistant (or PAA) (http://www.ebi.ac.uk/paa/) is a software system which assists protein annotators in the task of assigning functions to newly sequenced proteins. Working backward from SwissProt, a d atabase which describes known proteins, and a prior sequence similarity sea rch that returns a list of known proteins similar to a query, PAA suggests keywords and phrases which may describe functions performed by the query. I n a preprocessing step, a database is built from the protein names that app ear in the SwissProt database, and against each protein are listed key word s and phrases that are extracted from the corresponding text records. Commo n words either in general English usage or from the biological domain are r emoved as the phrases are assembled, This process is assisted by the use of a simple stemming algorithm, which extends the list of stop-words (i.e., r eject words), together with a list of accept-words. At runtime, the search algorithm, invoked by a user via a Web interface, takes a list of protein n ames and clusters the named proteins around keywords/phrases shared by memb ers of the list. The assumption is that if these proteins have a particular keyword/phrase in common, and they are related to a query protein, then th e keyword/phrase may also describe the query. Overall, PAA employs a number of in techniques in a novel setting and is thus related to text categoriza tion, where multiple categories may be suggested, except that in this case none of the categories are specified in advance.