ITA
ENG

Automatic rule generation for protein annotation with the C4.5 data miningalgorithm applied on SWISS-PROT

Authors

Kretschmann, E Fleischmann, W Apweiler, R

Citation

E. Kretschmann et al., Automatic rule generation for protein annotation with the C4.5 data miningalgorithm applied on SWISS-PROT, BIOINFORMAT, 17(10), 2001, pp. 920-926

Citations number

Categorie Soggetti

Multidisciplinary

Journal title

BIOINFORMATICS

ISSN journal

13674803 → ACNP

Volume

Issue

Year of publication

2001

Pages

920 - 926

Database

ISI

SICI code

1367-4803(200110)17:10<920:ARGFPA>2.0.ZU;2-#

Abstract

Motivation: The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature, curation and sequence analysis tools witho ut the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements t o manually curated databases such as TrEMBL or GenPept cover raw data but p rovide only limited annotation. To improve this situation automatic tools a re needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually gen erated annotations. Results: A standard data mining algorithm was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11306 rules were gene rated, which are, provided in a database and can be applied to yet unannota ted protein sequences and viewed using a web browser. They rely on the taxo nomy of the organism, in which the protein was found and on signature match es of its, sequence. The statistical evaluation of the generated rules by c ross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be, increased to 60% by tolerat ing a higher error rate of 5%.