ITA
ENG

Clustering protein sequences-structure prediction by transitive homology

Authors

Bolten, E Schliep, A Schneckener, S Schomburg, D Schrader, R

Citation

E. Bolten et al., Clustering protein sequences-structure prediction by transitive homology, BIOINFORMAT, 17(10), 2001, pp. 935-941

Citations number

Categorie Soggetti

Multidisciplinary

Journal title

BIOINFORMATICS

ISSN journal

13674803 → ACNP

Volume

Issue

Year of publication

2001

Pages

935 - 941

Database

ISI

SICI code

1367-4803(200110)17:10<935:CPSPBT>2.0.ZU;2-L

Abstract

Motivation: It is widely believed that for two proteins A and B a sequence identity above some threshold implies structural similarity due to a common evolutionary ancestor. Since this is only a sufficient, but not a necessar y condition for structural similarity, the question remains what other crit eria can be used to identify remote homologues. Transitivity refers to the concept of deducing a structural similarity betw een proteins A and C from the existence of a third protein B, such that A a nd B as well as B and C are homologues, as ascertained if the sequence iden tity between A and B as well as that between B and C is above the aforement ioned threshold. It is not fully understood if transitivity always holds an d whether transitivity can be extended ad infinitum. Results: We developed a graph-based clustering approach, where transitivity plays a crucial role. We determined all pair-wise similarities for the seq uences in the SwissProt database using the Smith-Waterman local alignment a lgorithm. This data was transformed into a directed graph, where protein se quences constitute vertices. A directed edge was drawn from vertex A to ver tex B if the sequences A and B showed similarity scaled with respect to the self-similarity of A, above a fixed threshold. Transitivity was important in the clustering process, as intermediate sequences were used, limited tho ugh by the requirement of having directed paths in both directions between proteins linked over such sequences. The length dependency-implied by the s elf-similarity-of the scaling of the alignment scores appears to be an effe ctive criterion to avoid clustering errors due to multi-domain proteins. To deal with the resulting large graphs we have developed an efficient libr ary. Methods include the novel graph-based clustering algorithm capable of handling multi-domain proteins and cluster comparison algorithms. Structura l Classification of Proteins (SCOP) was used as an evaluation data set for our method, yielding a 24% improvement over pair-wise comparisons in terms of detecting remote homologues.