Many large proteins have evolved by internal duplication and many internal
sequence repeats correspond to functional and structural units. We have dev
eloped an automatic algorithm, RADAR, for segmenting a query sequence into
repeats. The segmentation procedure has three steps: (i) repeat length is d
etermined by the spacing between suboptimal self-alignment traces; (ii) rep
eat borders are optimized to yield a maximal integer number of repeats, and
(iii) distant repeats are validated by iterative profile alignment. The me
thod identifies short composition biased as well as gapped approximate repe
ats and complex repeat architectures involving many different types of repe
ats in the query sequence. No manual intervention and no prior assumptions
on the number and length of repeats are required. Comparison to the Pfam-A
database indicates good coverage, accurate alignments, and reasonable repea
t borders. Screening the Swissprot database revealed 3,000 repeats not anno
tated in existing domain databases. A number of these repeats had been desc
ribed in the literature but most were novel. This illustrates how in times
when curated databases grapple with ever increasing backlogs, automatic (re
)analysis of sequences provides an efficient way to capture this important
information. Proteins 2000;41:224-237. (C) 2000 Wiley-Liss, Inc.