Extensive simulations for longest common subsequences - Finite size scaling, a cavity solution, and configuration space properties

Authors
Citation
Jb. De Monvel, Extensive simulations for longest common subsequences - Finite size scaling, a cavity solution, and configuration space properties, EUR PHY J B, 7(2), 1999, pp. 293-308
Citations number
31
Categorie Soggetti
Apllied Physucs/Condensed Matter/Materiales Science
Journal title
EUROPEAN PHYSICAL JOURNAL B
ISSN journal
14346028 → ACNP
Volume
7
Issue
2
Year of publication
1999
Pages
293 - 308
Database
ISI
SICI code
1434-6028(199901)7:2<293:ESFLCS>2.0.ZU;2-N
Abstract
Given two strings X and Y of N and M characters respectively, the Longest C ommon Subsequence (LCS) Problem asks for the longest sequence of (non-conti guous) matches between X and Y. Using extensive Monte-Carlo simulations for this problem, we find a finite size scaling law of the form E(L-N)/(N) = g amma s + A(s)/(ln N root N) +... for the average LCS length of two random s trings of size N over S letters. We provide precise estimates of gamma s fo r 2 less than or equal to S less than or equal to 15. We consider also a re lated Bernoulli Matching model where the different entries of an N x M arra y are occupied with a match independently with probability 1/S. On the basi s of a cavity-like analysis we find that the length of a longest sequence o f matches in that case behaves as L-NM(B) similar to gamma(S)(B) (r)N where r = M/N and gamma(S)(B) (r) = (2 root rS - r - 1)/(S - 1). This formula ag rees very well with our numerical computations. It provides a very good app roximation for the Random String model, the approximation getting more accu rate as S increases. The question of the "universality class" of the LCS pr oblem is also considered. Our results for the Bernoulli matching model show very good agreement with the scaling predictions of [15] for Needleman-Wun sch sequence alignment. We find however that the variance of the LCS length has a scaling different from Var(L-N) approximate to N-2/3 in the Random S tring model, suggesting that long-ranged correlations among the matches are relevant in this model. We finally study the "ground state" properties of this problem. We find that the number N-LCS Of solutions typically grows ex ponentially with N. In other words, this system does not satisfy "Nernst's principle". This is also reflected at the level of the overlap between two LCSs chosen at random, which is found to be self averaging and to approach a definite value qs < 1 as N --> infinity.