We describe a completely automated approach to identifying local seque
nce motifs that transcend protein family boundaries. Cluster analysis
is used to identify recurring patterns of variation at single position
s and in short segments of contiguous positions in multiple sequence a
lignments for a non-redundant set of protein families. Parallel experi
ments on simulated data sets constructed with the overall residue freq
uencies of proteins but not the inter-residue correlations show that n
aturally occurring protein sequences are significantly more clustered
than the corresponding random sequences for window lengths ranging fro
m one to 13 contiguous positions. The patterns of variation at single
positions are not in general surprising: chemically similar amino acid
s tend to be grouped together. More interesting patterns emerge as the
window length increases. The patterns of variation for longer window
lengths are in part recognizable patterns of hydrophobic and hydrophil
ic residues, and in part less obvious combinations. A particularly int
eresting class of patterns features highly conserved glycine residues.
The patterns provide a means to abstract the information contained in
multiple sequence alignments and may be useful for comparison of dist
antly related sequences or sequence families and for protein structure
prediction. (C) 1995 Academic Press Limited