Bvb. Reddy et Mw. Pandit, A STATISTICAL ANALYTICAL APPROACH TO DECIPHER INFORMATION FROM BIOLOGICAL SEQUENCES - APPLICATION TO MURINE SPLICE-SITE ANALYSIS AND PREDICTION, Journal of biomolecular structure & dynamics, 12(4), 1995, pp. 785-801
A simple statistical approach for the analysis of biological sequences
, such as splice-sites, promoter regions, helices and extended structu
re forming regions or any other sequence dependent functional entities
in proteins, is presented, The approach has been proved useful to dev
elop it method for prediction of such entities in newly available sequ
ences. We first search for invariant sequence features of each functio
nal entity from the experimentally available sequences and identify a
set of 'like' sequences with similar sequence features, In the next st
ep, concrete features of sequence entities ill terms of occurrences of
smaller subsequences are identified at various positions which are us
ed as a knowledge base to select potential functional entities from th
e identified 'like' sequences, The third step consists of refinement o
f this pattern learning, statistical improvements of the knowledge bas
e weight matrices and finally its application to predict functional en
tities in newly available sequences. Such an analysis is operationally
described for murine splice-site predictions. Regions comprising -30
to +30 nucleotides from the splice-junction at the murine splice sites
(donors and acceptors), reported earlier, were analyzed, Invariant se
quence-specific features in terms of monomer frequency average were us
ed to identify splice site-like sequences in the EMBL murine DNA seque
nce database, The frequencies of occurrence of mono-, di-, tri- and te
tranucleotides in the known splice-sites were studied in comparison wi
th the splice-site-like sequences, the significant differences in thei
r occurrences were extracted as statistical knowledge coded in weight
matrices far computer to identify potential splice-sites, The algorith
m was refined and a method was developed to predict potential splice-s
ites in a given murine DNA; the analysis was also extended to human DN
A. The success rate of the method to predict correct splice-sites in t
hese species is. found to be 80% and 85%, respectively The major stren
gth of this method lies in reducing significantly the number of false
positives which are normally picked up is such analysis.