C. Landes et al., DOT-PLOT COMPARISONS BY MULTIVARIATE-ANALYSIS (DOCMA) - A TOOL FOR CLASSIFYING PROTEIN SEQUENCES, Computer applications in the biosciences, 9(2), 1993, pp. 191-196
A method aimed at classifying protein sequences without resorting to p
airwise alignment is presented. Called DOCMA (DOt-plot Comparisons by
Multivariate Analysis), it is based on a multivariate analysis of the
pairwise dot-plots between all the sequences in the set. The dot-plots
are first simplified by considering only the projections of the 'diag
onal' segments of similarity onto the axes. From these projections a d
ata matrix is built, in which each column is representative of the com
parisons of one given sequence with all the other ones. This data matr
ix is then transformed into a distance matrix by a chi-squared analysi
s, from which the coordinates of the sequences in an orthonormal Eucli
dean space are obtained. The sequences are finally classified by a dyn
amic clustering procedure followed by a search for strong clusters. Ap
plication of this method to protein families such as the globins, the
cytochromes c and the aminoacyl-tRNA synthetases shows that it is quit
e effective in delineating subgroups that contain even distantly relat
ed sequences.