# String Matching with k-difference

Given a string T with length n and a string P with length m and an integer k, design an O(nk)-time algorithm to find a substring of T that matches with P with edit distance at most k.

Solution:

By Wagner–Fischer algorithm, we have a recurrence relation D(i, j) = min( D(i – 1, j) + 1, D(i, – 1) + 1, D(– 1, – 1) + [T[i] != P[j]), where D(i, j) is the minimum edit distance between a suffix of T[1..i] and P[1..j]. However, the time complexity is O(mn). Since we don’t care the matching with distance more than k, we have a way to improve the performance.

Note that the diagonal of table D is never decreasing and an interval of diagonal with the same value means there is an exact between a substring of S and a substring of T. The idea is to compute the frontier of entries incrementally and skip same values. Let L(q) be the set of (i, j) that D(i, j) = q but D(i-1, j-1) != q. If (i, j) is in L(q), then we know that all D(i+t, j+t) have the same value, where t is no more than the length of the longest common prefix between S[i..n] and P[j..m]. Moreover, we know that (i+t+1, j+t+1) is in L(q+1). Hence, we can incrementally build L(1) … L(k). The longest common prefix can be computed in O(1) time by using a suffix tree. Since the size of any L(q) is at most n, the total time complexity is O(kn).

Reference: Stackoverflow

Gad M. Landau and Uzi Vishkin, “Fast string matching with k differences,” Journal of Computer and System Sciences Volume 37, Issue 1, August 1988, Pages 63–78

# Square Factor

Given a string S with length n, design a O(n lg n) time algorithm to find a substring P, such that PP is a substring of S.

Solution:

The solution is based on divide and conquer. Let S = uv, where the length of u and v differs at most one. We can recursively detect square factor in u and v, if u and v are square-free, then we can test whether there is a square factor centered at v and extends to u or vice versa. Here, we solve the case that the center is at v, the other case is symmetric. Let P(k) be the size of the longest prefix of v that is also a prefix of v[k..|v|]. Let S(k) be the size of the longest suffix of v[1..k] that is also a suffix of u. There exists a square factor centered at v if, and only if, P(k) + S(k) ≥ k for some k. The function of P and S can be computed in linear time. Hence, we have a recurrence relation of the time complexity: T(n) = 2T(n/2) + O(n). By master theorem, the time complexity is O(n lg n).

We can use the idea of equivalent classes to find any higher power of P factor in O(n lg n) time.

For a fixed size of alphabet, this problem can be solved in linear time by using f-factorization. This problem can be solved in linear time by using suffix tree without the assumption of the size of alphabet.

Reference: Stackoverflow

Max Crochemore, “An optimal algorithm for computing the repetitions in a word,” Information Processing Letters Volume 12, Issue 5, 13 October 1981, Pages 244–250