A STATISTICAL ANALYTICAL APPROACH TO DECIPHER INFORMATION FROM BIOLOGICAL SEQUENCES - APPLICATION TO MURINE SPLICE-SITE ANALYSIS AND PREDICTION

Citation
Bvb. Reddy et Mw. Pandit, A STATISTICAL ANALYTICAL APPROACH TO DECIPHER INFORMATION FROM BIOLOGICAL SEQUENCES - APPLICATION TO MURINE SPLICE-SITE ANALYSIS AND PREDICTION, Journal of biomolecular structure & dynamics, 12(4), 1995, pp. 785-801
Citations number
29
Categorie Soggetti
Biophysics,Biology
ISSN journal
07391102
Volume
12
Issue
4
Year of publication
1995
Pages
785 - 801
Database
ISI
SICI code
0739-1102(1995)12:4<785:ASAATD>2.0.ZU;2-5
Abstract
A simple statistical approach for the analysis of biological sequences , such as splice-sites, promoter regions, helices and extended structu re forming regions or any other sequence dependent functional entities in proteins, is presented, The approach has been proved useful to dev elop it method for prediction of such entities in newly available sequ ences. We first search for invariant sequence features of each functio nal entity from the experimentally available sequences and identify a set of 'like' sequences with similar sequence features, In the next st ep, concrete features of sequence entities ill terms of occurrences of smaller subsequences are identified at various positions which are us ed as a knowledge base to select potential functional entities from th e identified 'like' sequences, The third step consists of refinement o f this pattern learning, statistical improvements of the knowledge bas e weight matrices and finally its application to predict functional en tities in newly available sequences. Such an analysis is operationally described for murine splice-site predictions. Regions comprising -30 to +30 nucleotides from the splice-junction at the murine splice sites (donors and acceptors), reported earlier, were analyzed, Invariant se quence-specific features in terms of monomer frequency average were us ed to identify splice site-like sequences in the EMBL murine DNA seque nce database, The frequencies of occurrence of mono-, di-, tri- and te tranucleotides in the known splice-sites were studied in comparison wi th the splice-site-like sequences, the significant differences in thei r occurrences were extracted as statistical knowledge coded in weight matrices far computer to identify potential splice-sites, The algorith m was refined and a method was developed to predict potential splice-s ites in a given murine DNA; the analysis was also extended to human DN A. The success rate of the method to predict correct splice-sites in t hese species is. found to be 80% and 85%, respectively The major stren gth of this method lies in reducing significantly the number of false positives which are normally picked up is such analysis.