Annotation transfer for genomics: Measuring functional divergence in multi-domain proteins

Citation
H. Hegyi et M. Gerstein, Annotation transfer for genomics: Measuring functional divergence in multi-domain proteins, GENOME RES, 11(10), 2001, pp. 1632-1640
Citations number
28
Categorie Soggetti
Molecular Biology & Genetics
Journal title
GENOME RESEARCH
ISSN journal
10889051 → ACNP
Volume
11
Issue
10
Year of publication
2001
Pages
1632 - 1640
Database
ISI
SICI code
1088-9051(200110)11:10<1632:ATFGMF>2.0.ZU;2-P
Abstract
Annotation transfer is a principal process in genome annotation. It involve s "transferring" structural and functional annotation to uncharacterized op en reading frames (ORFs) in a newly completed genome from experimentally ch aracterized proteins similar in sequence. To prevent errors in genome annot ation, it is important that this process be robust and statistically well-c haracterized, especially with regard to how it depends on the degree of seq uence similarity. Previously, we and others have analyzed annotation transf er in single-domain proteins. Multi-domain proteins, which make Lip the bul k of the ORFs in eukaryotic genomes, present more complex issues in functio nal conservation. Here we present a large-scale survey of annotation transf er in these proteins, using scop superfamilies to define domain folds and a thesaurus based on SWISS-PROT keywords to define functional categories. Ou r survey reveals that multi-domain proteins have significantly less functio nal conservation than single-domain ones, except when they share the exact same combination of domain folds. In particular, we find that for multi-dom ain proteins, approximate function can be accurately transferred with only 35% certainty for pairs of proteins sharing one structural superfamily. In contrast, this Value is 67% for pairs of single-domain proteins sharing the same structural superfamily. On the other hand, if two multi-domain protei ns contain the same combination of two structural superfamilies the probabi lity of their sharing the same function increases to 80% in the case of com plete coverage along the full length of both proteins, this value increases further to > 90%. Moreover, we found that only 70 of the current total of 455 structural superfamilies are found in both single and multi-domain prot eins and only 14 of these were associated with the same function in both ca tegories of proteins. We also investigated the degree to which function Cou ld be transferred between pairs of multi-domain proteins with respect to th e degree of sequence similarity between them, finding that functional diver gence at a given amount of sequence similarity is always about two-fold gre ater for pairs of multi-domain proteins (sharing similarity over a single d omain) in comparison to pairs of single-domain ones, though the overall sha pe of the relationship is quite similar. Further information is available a t http://partslist.org/func or http://bioinfo.mbb.yale.edu/ partslist/func.