ITA
ENG

MULTIPLE SEQUENCE COMPARISON - A PEPTIDE MATCHING APPROACH

Authors

SAGOT MF VIARI A SOLDANO H

Citation

Mf. Sagot et al., MULTIPLE SEQUENCE COMPARISON - A PEPTIDE MATCHING APPROACH, Theoretical computer science, 180(1-2), 1997, pp. 115-137

Citations number

Categorie Soggetti

Computer Sciences","Computer Science Theory & Methods

Journal title

Theoretical computer science → ACNP

ISSN journal

03043975

Volume

180

Issue

1-2

Year of publication

1997

Pages

115 - 137

Database

ISI

SICI code

0304-3975(1997)180:1-2<115:MSC-AP>2.0.ZU;2-O

Abstract

We present in this paper a peptide matching approach to the multiple c omparison of a set of protein sequences. This approach consists in loo king for all the words that are common to q of these sequences, where q is a parameter. The comparison between words is done by using as ref erence an object called a model, In the case of proteins, a model is a product of subsets of the alphabet Sigma of the amino acids. These su bsets belong to a cover of Sigma, that is, their union covers all of S igma. A word is said to be an instance of a model if it belongs to the model, A further flexibility is introduced in the comparison by allow ing for up to e errors in the comparison between a word and a model. T hese errors may concern gaps or substitutions not allowed by the cover . A word is said to be this time an occurrence of a model if the Leven shtein distance between it and an instance of the model is inferior or equal to e. This corresponds to what we call a Set-Levenshtein distan ce between the occurrences and the model itself. Two words are said to be similar if there is at least one model of which both are occurrenc es. In the special case where e = 0, the occurrences of a model are si mply its instances, If a model M has occurrences in at least q of the sequences of the set, M is said to occur in the set. The algorithm pre sented here is an efficient and exact way of looking for all the model s, of a fixed length k or of the greatest possible length k(max), that occur in a set of sequences. It is linear in the total length n of th e sequences and proportional to (e + 2)(2e + 1)k(e + 1)p(e + 1)g(k) wh ere k much less than n is a small value in all practical situations, p is the number of sets in the cover and g is related to the latter's d egree of nontransitivity. Models are closely related to what is called a consensus in the biocomputing area, and covers are a good way of re presenting complex relationships between the amino acids.