String Editing and Longest Common Subsequences
The string editing problem for input strings x and y consists of transforming x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbo
- PDF / 4,285,781 Bytes
- 38 Pages / 439.37 x 666.142 pts Page_size
- 9 Downloads / 203 Views
Summary. The string editing problem for input strings x and y consists of transforming x into y by performing aseries of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol of x with another symbol. String editing models a variety of problems arising in such diverse areas as text and speech processing, geology and, last but not least, molecular biology. Special cases of string editing include the longest common subsequence problem, local alignment and similarity searching in DNA and protein sequences, and approximate string searching. We describe serial and parallel algorithmic solutions for the problem and some of its basic variants.
1. Introduction Let x be astring of lxi = rn symbols from some alphabet E of cardinality s. We say that E is bounded when s is a constant independent of rn, unbounded otherwise. We consider three edit operations on x, namely, deletion of a symbol from x, insertion of a new symbol in x, and substitution of one of the symbols of x with another symbol from E. We assurne that each edit operation has an associated nonnegative real number representing the cost of that operation. More precisely, the cost of deleting from x an occurrence of symbol a is denoted by D( a), the cost of inserting some symbol a between any two consecutive positions of xis denoted by I(a) and the cost of substituting some occurrence of a in x with an occurrence of bis denoted by S(a, b). An edit script on x is any sequence S of viable edit operations on x, and the cost of S is the sum of all costs of the edit operations in S. For any given string x, there is always some edit script S that transforms x into any other string of E*, so that the only possible questions of interest revolve around the cost of S. Let x and y be two strings of respe~tive lengths lxi = rn and lyl = n ~ rn. The string editing problem for input strings x and y consists of finding an edit script SI of minimum cost that transforms x into y. The cost of S, is the edit distance from x to y. Edit distances where individual operations are assigned integer or unit costs occupy a special place. Such distances are often called Levenshtein distances, since they were introduced by Levenshtein [1966] in connection with error correcting codes. String editing finds applications in a broad variety of contexts, ranging from
G. Rozenberg et al. (eds.), Handbook of Formal Languages © Springer-Verlag Berlin Heidelberg 1997
362
A. Apostolico
text and speech processing to geology, from computer vision to molecular biology. It is not difficult to see that the general (i.e., with unbounded alphabet and unrestricted costs) problem of edit distance computation is solved by a serial algorithm in 8(mn) time and space, through dynamic programming. Due to widespread application of the problem, however, such a solution and a few basic variants were discovered and published in literat ure catering to such diverse disciplines (see, e.g., Sank
Data Loading...