A NEW CHALLENGE FOR COMPRESSION ALGORITHMS - GENETIC SEQUENCES

Authors
Citation
S. Grumbach et F. Tahi, A NEW CHALLENGE FOR COMPRESSION ALGORITHMS - GENETIC SEQUENCES, Information processing & management, 30(6), 1994, pp. 875-886
Citations number
18
Categorie Soggetti
Information Science & Library Science","Information Science & Library Science","Computer Science Information Systems
ISSN journal
03064573
Volume
30
Issue
6
Year of publication
1994
Pages
875 - 886
Database
ISI
SICI code
0306-4573(1994)30:6<875:ANCFCA>2.0.ZU;2-H
Abstract
Universal data compression algorithms fail to compress genetic sequenc es. It is due to the specificity of this particular kind of ''text.'' We analyze in some detail the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algor ithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the pre sence of palindromes. The algorithm combines substitutional and statis tical methods, and to the best of our knowledge, leads to the highest compression of DNA. The results, although not satisfactory, give insig ht to the necessary correlation between compression and comprehension of genetic sequences.