Aj. Doig, IMPROVING THE EFFICIENCY OF THE GENETIC-CODE BY VARYING THE CODON LENGTH - THE PERFECT GENETIC-CODE, Journal of theoretical biology, 188(3), 1997, pp. 355-360
The function of DNA is to specify protein sequences. The four-base ''a
lphabet'' used in nucleic acids is translated to the 20 base alphabet
of proteins (plus a stop signal) via the genetic code. The code is nei
ther overlapping nor punctuated, but has mRNA sequences read in succes
sive triplet codons until reaching a stop codon. The true genetic code
uses three bases for every amino acid. The efficiency of the genetic
code can be significantly increased if the requirement for a fixed cod
on length is dropped so that the more common amino acids have shorter
codon lengths and rare amino acids have longer codon lengths. More eff
icient codes can be derived using the Shannon-Fano and Huffman coding
algorithms. The compression achieved using a Huffman code cannot be im
proved upon. I have used these algorithms to derive efficient codes fo
r representing protein sequences using both two and four bases. The le
ngth of DNA required to specify the complete set of protein sequences
could be significantly shorter if transcription used a variable codon
length. The restriction to a fixed codon length of three bases means t
hat it takes 42% more DNA than the minimum necessary, and the genetic
code is 70% efficient. One can think of many reasons why this maximall
y efficient code has not evolved: there is very little redundancy so a
lmost any mutation causes an amino acid change. Many mutations will be
potentially lethal frame-shift mutations, if the mutation leads to a
change in codon length. It would be more difficult for the machinery o
f transcription to cope with a variable codon length. Nevertheless, in
the strict and narrow sense of coding for protein sequences using the
minimum length of DNA possible, the Huffman code derived here is perf
ect. (C) 1997 Academic Press Limited.