MULTIPLE SEQUENCE COMPARISON - A PEPTIDE MATCHING APPROACH

Citation
Mf. Sagot et al., MULTIPLE SEQUENCE COMPARISON - A PEPTIDE MATCHING APPROACH, Theoretical computer science, 180(1-2), 1997, pp. 115-137
Citations number
30
Categorie Soggetti
Computer Sciences","Computer Science Theory & Methods
ISSN journal
03043975
Volume
180
Issue
1-2
Year of publication
1997
Pages
115 - 137
Database
ISI
SICI code
0304-3975(1997)180:1-2<115:MSC-AP>2.0.ZU;2-O
Abstract
We present in this paper a peptide matching approach to the multiple c omparison of a set of protein sequences. This approach consists in loo king for all the words that are common to q of these sequences, where q is a parameter. The comparison between words is done by using as ref erence an object called a model, In the case of proteins, a model is a product of subsets of the alphabet Sigma of the amino acids. These su bsets belong to a cover of Sigma, that is, their union covers all of S igma. A word is said to be an instance of a model if it belongs to the model, A further flexibility is introduced in the comparison by allow ing for up to e errors in the comparison between a word and a model. T hese errors may concern gaps or substitutions not allowed by the cover . A word is said to be this time an occurrence of a model if the Leven shtein distance between it and an instance of the model is inferior or equal to e. This corresponds to what we call a Set-Levenshtein distan ce between the occurrences and the model itself. Two words are said to be similar if there is at least one model of which both are occurrenc es. In the special case where e = 0, the occurrences of a model are si mply its instances, If a model M has occurrences in at least q of the sequences of the set, M is said to occur in the set. The algorithm pre sented here is an efficient and exact way of looking for all the model s, of a fixed length k or of the greatest possible length k(max), that occur in a set of sequences. It is linear in the total length n of th e sequences and proportional to (e + 2)(2e + 1)k(e + 1)p(e + 1)g(k) wh ere k much less than n is a small value in all practical situations, p is the number of sets in the cover and g is related to the latter's d egree of nontransitivity. Models are closely related to what is called a consensus in the biocomputing area, and covers are a good way of re presenting complex relationships between the amino acids.