AUTOMATED PROTEIN-SEQUENCE DATABASE CLASSIFICATION - I - INTEGRATION OF COMPOSITIONAL SIMILARITY SEARCH, LOCAL SIMILARITY SEARCH, AND MULTIPLE SEQUENCE ALIGNMENT

Authors
Citation
J. Gracy et P. Argos, AUTOMATED PROTEIN-SEQUENCE DATABASE CLASSIFICATION - I - INTEGRATION OF COMPOSITIONAL SIMILARITY SEARCH, LOCAL SIMILARITY SEARCH, AND MULTIPLE SEQUENCE ALIGNMENT, BIOINFORMATICS, 14(2), 1998, pp. 164-173
Citations number
35
Categorie Soggetti
Computer Science Interdisciplinary Applications","Biology Miscellaneous","Computer Science Interdisciplinary Applications","Biochemical Research Methods
Journal title
ISSN journal
13674803
Volume
14
Issue
2
Year of publication
1998
Pages
164 - 173
Database
ISI
SICI code
1367-4803(1998)14:2<164:APDC-I>2.0.ZU;2-L
Abstract
Motivation: Genome sequencing projects require the periodic applicatio n of analysis tools that can classify and multiply align related prote in sequence domains. Full automation of this task requires an efficien t integration of similarity and alignment techniques. Results: We have developed a fully automated process that classifies entire protein se quence databases, resulting in alignment of the homologous sequences. The successive steps of the procedure ar-e based on compositional and local sequence similarity searches followed by multiple sequence align ments. Global similarities are detected from the pairwise comparison o f amino acid and dipeptide compositions of each protein. After the eli mination of all but one sequence from each detected cluster of closely related proteins, the remaining sequences are compiled in a suffix tl ee which is self-compared to detect local sequence similarities. Sets of proteins which share similar sequence segments are then weighted a ccording to their closeness and multiply aligned using a fast hierarch ical dynamic programming algorithm. Computational strategies were devi sed to minimize computer processing time and memory space requirements . The accuracy of the sequence classifications has been evaluated for 12 462 primary structures distributed over 341 known families. The per centage of sequences with missed or incorrect family assignments was 6 .8% on the test set. This low en or level is only twice that of the ma nually constructed PROSITE database (3.4%) and is substantially better than that found for the automatically built PRODOM database (34.9%).