String Matching with k-difference

Given a string T with length n and a string P with length m and an integer k, design an O(nk)-time algorithm to find a substring of T that matches with P with edit distance at most k.


By Wagner–Fischer algorithm, we have a recurrence relation D(i, j) = min( D(i – 1, j) + 1, D(i, – 1) + 1, D(– 1, – 1) + [T[i] != P[j]), where D(i, j) is the minimum edit distance between a suffix of T[1..i] and P[1..j]. However, the time complexity is O(mn). Since we don’t care the matching with distance more than k, we have a way to improve the performance.

Note that the diagonal of table D is never decreasing and an interval of diagonal with the same value means there is an exact between a substring of S and a substring of T. The idea is to compute the frontier of entries incrementally and skip same values. Let L(q) be the set of (i, j) that D(i, j) = q but D(i-1, j-1) != q. If (i, j) is in L(q), then we know that all D(i+t, j+t) have the same value, where t is no more than the length of the longest common prefix between S[i..n] and P[j..m]. Moreover, we know that (i+t+1, j+t+1) is in L(q+1). Hence, we can incrementally build L(1) … L(k). The longest common prefix can be computed in O(1) time by using a suffix tree. Since the size of any L(q) is at most n, the total time complexity is O(kn).


藉由Wagner–Fischer algorithm,我們可知一遞迴關係式D(ij) = min( D(i – 1, j) + 1, D(i– 1) + 1, D(– 1, – 1) + [T[i] != P[j]),其中D(ij)是T[1..i]的一個後綴與P[1..j]的最小edit distance。然而此演算法的時間複雜度為O(mn)。因為我們只需要考慮edit distance小於k的部份,我們可以加速此演算法。

表格D中的對角線是不會非遞增的,而且對角線上的一段連續同值的區間表示ST的子字串完全符合。解法是基於計算計算出對角線的前沿,同時跳過連續值的區間。令L(q)為(ij)的集合使得D(ij) = q但是D(i-1, j-1) != q。如果(ij)在L(q)之中,那麼我們知道對於所有小於S[i..n] and P[j..m]的最長相同前綴的t來說,D(i+tj+t)會有一樣的值。此外我們知道(i+t+1, j+t+1)必在L(q+1)之中。因此我們可以循序的計算L(1) … L(k)。最長相同前綴可由suffix tree在O(1)時間內計算而得。因為L(q)的大小至多為n,總時間複雜度為O(kn)。

Reference: Stackoverflow

Gad M. Landau and Uzi Vishkin, “Fast string matching with k differences,” Journal of Computer and System Sciences Volume 37, Issue 1, August 1988, Pages 63–78

Square Factor

Given a string S with length n, design a O(n lg n) time algorithm to find a substring P, such that PP is a substring of S.


The solution is based on divide and conquer. Let S = uv, where the length of u and v differs at most one. We can recursively detect square factor in u and v, if u and v are square-free, then we can test whether there is a square factor centered at v and extends to u or vice versa. Here, we solve the case that the center is at v, the other case is symmetric. Let P(k) be the size of the longest prefix of v that is also a prefix of v[k..|v|]. Let S(k) be the size of the longest suffix of v[1..k] that is also a suffix of u. There exists a square factor centered at v if, and only if, P(k) + S(k) ≥ k for some k. The function of P and S can be computed in linear time. Hence, we have a recurrence relation of the time complexity: T(n) = 2T(n/2) + O(n). By master theorem, the time complexity is O(n lg n).

We can use the idea of equivalent classes to find any higher power of P factor in O(n lg n) time.

For a fixed size of alphabet, this problem can be solved in linear time by using f-factorization. This problem can be solved in linear time by using suffix tree without the assumption of the size of alphabet.


解法是基於d&c。令S = uv,u和v的長度至多差一。我們可以遞迴的判斷u和v,是否含有square factor。如果沒有,那麼我們可以判斷是不是存在有一個square factor其中心位於v然後延伸到u或是中心位於u延伸到v。我們在這邊只考慮中心位於v的情況,因為另一個情況是對稱的。令P(k)為v的最長前綴且同時為v[k..|v|]之前綴的長度。令S(k)為v[1..k]的最長後綴且同時為u之後綴的長度。如果有square factor且中心在v若且唯若存在k使得P(k) + S(k) ≥ k。函數PS可在線性時間內計算得出。因此時間複雜度滿足此遞迴關係式T(n) = 2T(n/2) + O(n)。利用master theorem可知時間複雜度為O(n lg n)。

Reference: Stackoverflow

Max Crochemore, “An optimal algorithm for computing the repetitions in a word,” Information Processing Letters Volume 12, Issue 5, 13 October 1981, Pages 244–250