Tl. Bailey et al., AN ARTIFICIAL-INTELLIGENCE APPROACH TO MOTIF DISCOVERY IN PROTEIN SEQUENCES - APPLICATION TO STEROID DEHYDROGENASES, Journal of steroid biochemistry and molecular biology, 62(1), 1997, pp. 29-44
MEME (Multiple Expectation-maximization for Motif Elicitation) is a un
ique new software tool that uses artificial intelligence techniques to
discover motifs shared by a set of protein sequences in a fully autom
ated manner. This paper is the first detailed study of the use of MEME
to analyse a large, biologically relevant set of sequences, and to ev
aluate the sensitivity and accuracy of MEME in identifying structurall
y important motifs. For this purpose, we chose the short-chain alcohol
dehydrogenase superfamily because it is large and phylogenetically di
verse, providing a test of how well MEME can work on sequences with lo
w amino acid similarity. Moreover, this dataset contains enzymes of bi
ological importance, and because several enzymes have known X-ray crys
tallographic structures, we can test the usefulness of MEME for struct
ural analysis. The first six motifs from MEME map onto structurally im
portant alpha-helices and beta-strands on Streptomyces hydrogenans 20
beta-hydroxysteroid dehydrogenase. We also describe MAST (Motif Alignm
ent Search Tool), which conveniently uses output from MEME for searchi
ng databases such as SWISS-PROT and Genpept. MAST provides statistical
measures that permit a rigorous evaluation of the significance of dat
abase searches with individual motifs or groups of motifs. A database
search of Genpept90 by MAST with the log-odds matrix of the first six
motifs obtained from MEME yields a bimodal output, demonstrating the s
electivity of MAST. We show for the first time, using primary sequence
analysis, that bacterial sugar epimerases are homologs of short-chain
dehydrogenases. MEME and MAST will be increasingly useful as genome s
equencing provides large datasets of phylogenetically divergent sequen
ces of biomedical interest. (C) 1997 Elsevier Science Ltd.