This report describes the application of a simple computational tool, AAPAI
R.TAB, for the systematic analysis of the cysteine-rich EGF, Sushi, and Lam
inin motif/sequence families at the two-amino acid level. Automated dipepti
de frequency/bias analysis detects preferences in the distribution of amino
acids in established protein families, by determining which "ordered dipep
tides" occur most frequently in comprehensive motif-specific sequence data
sets. Graphic display of the dipeptide frequency/bias data revealed family-
specific preferences for certain dipeptides, but more importantly detected
a shared preference for employment of the ordered dipeptides Gly-Tyr (GY) a
nd Gly-Phe (GF) in all three protein families. The dipeptide Asn-Gly (NG) a
lso exhibited high-frequency and bias in the EGF and Sushi motif families,
whereas Asn-Thr (NT) was distinguished in the Laminin family. Evaluation of
the distribution of dipeptides identified by frequency/bias analysis subse
quently revealed the highly restricted localization of the G(F/Y) and N(G/T
) sequence elements at two separate sites of extreme conservation in the co
nsensus sequence of all three sequence families. The similar employment of
the high-frequency/bias dipeptides in three distinct protein sequence famil
ies was further correlated with the concurrence of these shared molecular d
eterminants at similar positions within the distinctive scaffolds of three
structurally divergent, but similarly employed, motif modules. (C) 2001 Wil
ey-Liss, Inc.