Given a string T with length n and a string P with length m and an integer k, design an O(nk)-time algorithm to find a substring of T that matches with P with edit distance at most k.
Solution:
By Wagner–Fischer algorithm, we have a recurrence relation D(i, j) = min( D(i – 1, j) + 1, D(i, j – 1) + 1, D(i – 1, j – 1) + [T[i] != P[j]), where D(i, j) is the minimum edit distance between a suffix of T[1..i] and P[1..j]. However, the time complexity is O(mn). Since we don’t care the matching with distance more than k, we have a way to improve the performance.
Note that the diagonal of table D is never decreasing and an interval of diagonal with the same value means there is an exact between a substring of S and a substring of T. The idea is to compute the frontier of entries incrementally and skip same values. Let L(q) be the set of (i, j) that D(i, j) = q but D(i-1, j-1) != q. If (i, j) is in L(q), then we know that all D(i+t, j+t) have the same value, where t is no more than the length of the longest common prefix between S[i..n] and P[j..m]. Moreover, we know that (i+t+1, j+t+1) is in L(q+1). Hence, we can incrementally build L(1) … L(k). The longest common prefix can be computed in O(1) time by using a suffix tree. Since the size of any L(q) is at most n, the total time complexity is O(kn).
解法:
藉由Wagner–Fischer algorithm,我們可知一遞迴關係式D(i, j) = min( D(i – 1, j) + 1, D(i, j – 1) + 1, D(i – 1, j – 1) + [T[i] != P[j]),其中D(i, j)是T[1..i]的一個後綴與P[1..j]的最小edit distance。然而此演算法的時間複雜度為O(mn)。因為我們只需要考慮edit distance小於k的部份,我們可以加速此演算法。
表格D中的對角線是不會非遞增的,而且對角線上的一段連續同值的區間表示S與T的子字串完全符合。解法是基於計算計算出對角線的前沿,同時跳過連續值的區間。令L(q)為(i, j)的集合使得D(i, j) = q但是D(i-1, j-1) != q。如果(i, j)在L(q)之中,那麼我們知道對於所有小於S[i..n] and P[j..m]的最長相同前綴的t來說,D(i+t, j+t)會有一樣的值。此外我們知道(i+t+1, j+t+1)必在L(q+1)之中。因此我們可以循序的計算L(1) … L(k)。最長相同前綴可由suffix tree在O(1)時間內計算而得。因為L(q)的大小至多為n,總時間複雜度為O(kn)。
Reference: Stackoverflow
Gad M. Landau and Uzi Vishkin, “Fast string matching with k differences,” Journal of Computer and System Sciences Volume 37, Issue 1, August 1988, Pages 63–78