Kk. Amfoh et al., THE USE OF LOGISTIC-MODELS FOR THE ANALYSIS OF CODON FREQUENCIES OF DNA-SEQUENCES IN TERMS OF EXPLANATORY VARIABLES, Biometrics, 50(4), 1994, pp. 1054-1063
The development of the regressive logistic model applicable to the ana
lysis of codon frequencies of DNA sequences in terms of explanatory va
riables is presented. A codon is a triplet of nucleotides that code fo
r an amino acid, and may be considered as a trivariate response (B-1,
B-2, B-3,), where B-i (i = 1, 2, 3) is a categorical random variable w
ith values A, C, G, T. The linear order of bases in the DNA and possib
le statistical dependence of the bases in a given codon make the regre
ssive logistic model a suitable tool for the analysis of codon frequen
cies. A problem of structural zeros arises from the fact that the stop
ping codons (terminators) do not code for amino acids; this is solved
by normalizing the likelihood function. Codon frequencies may also dep
end on the function of the gene and they are known to differ between g
enes of the same genome. Differences also occur between synonymous cod
ons for the same amino acid. Thus, the use of covariates that differ b
etween synonymous codons as well as covariates that are constant withi
n codons of the same amino acid may be useful in explaining the freque
ncies. As an illustration, the method is applied to the human mitochon
drial genome using the following as explanatory variables: (1) TSCORE,
a measure of the number of single base mutations required for a given
codon to become a terminator; (2) AARISK, an indicator of a codon's a
bility of changing by a single base substitution to triplets coding fo
r amino acids with very different characteristics; (3) AVDIST, a measu
re of the typicality of the amino acid coded for by the triplets. The
results indicate that models that incorporate dependency structure and
covariates are to be preferred to either the models comprising covari
ates alone or dependency structure alone.