ITA
ENG

CLEANUP - A FAST COMPUTER-PROGRAM FOR REMOVING REDUNDANCIES FROM NUCLEOTIDE-SEQUENCE DATABASES

Authors

GRILLO G ATTIMONELLI M LIUNI S PESOLE G

Citation

G. Grillo et al., CLEANUP - A FAST COMPUTER-PROGRAM FOR REMOVING REDUNDANCIES FROM NUCLEOTIDE-SEQUENCE DATABASES, Computer applications in the biosciences, 12(1), 1996, pp. 1-8

Citations number

Categorie Soggetti

Mathematical Methods, Biology & Medicine","Computer Sciences, Special Topics","Computer Science Interdisciplinary Applications","Biology Miscellaneous

Journal title

Computer applications in the biosciences → ACNP

ISSN journal

02667061

Volume

Issue

Year of publication

1996

Pages

1 - 8

Database

ISI

SICI code

0266-7061(1996)12:1<1:C-AFCF>2.0.ZU;2-Y

Abstract

A key concept in comparing sequence collections is the issue of redund ancy. The production of sequence collections free from redundancy is u ndoubtedly very useful, both in performing statistical analyses and ac celerating extensive database searching on nucleotide sequences. Indee d, publicly available databases contain multiple entries of identical or almost identical sequences. Performing statistical analysis on such biased data makes the risk of assigning high significance to non-sign ificant patterns very high. In order to carry out unbiased statistical analysis as well as more efficient database searching it is thus nece ssar), to analyse sequence data that have been purged of redundancy. G iven that a unambiguous definition of redundancy is impracticable for biological sequence data, in the present program a quantitative descri ption of redundancy will be used, based on the measure of sequence sim ilarity. A sequence is considered redundant if it shows a degree of si milarity and overlapping with a longer sequence in the database greate r than a threshold fixed by the user. In this paper we present a new a lgorithm based on an approximate string matching' procedure, which is able to determine the overall degree of similarity between each pair o f sequences contained in a nucleotide sequence database and to generat e automatically nucleotide sequence collections free from redundancies .