Given a string *T* with length *n* and a string *P* with length *m* and an integer *k*, design an O(*nk*)-time algorithm to find a substring of *T* that matches with *P* with edit distance at most *k*.

**Solution**:

By Wagner–Fischer algorithm, we have a recurrence relation *D*(*i*, *j*) = min( *D*(*i* – 1, *j*) + 1, *D*(*i*, *j *– 1) + 1, *D*(*i *– 1, *j *– 1) + [*T*[*i*] != *P*[*j*]), where *D*(*i*, *j*) is the minimum edit distance between a suffix of *T*[1..*i*] and *P*[1..*j*]. However, the time complexity is O(*mn*). Since we don’t care the matching with distance more than *k*, we have a way to improve the performance.

Note that the diagonal of table *D* is never decreasing and an interval of diagonal with the same value means there is an exact between a substring of *S* and a substring of *T*. The idea is to compute the frontier of entries incrementally and skip same values. Let *L*(*q*) be the set of (*i*, *j*) that *D*(*i*, *j*) = *q* but *D*(*i*-1, *j*-1) != *q*. If (*i*, *j*) is in *L*(*q*), then we know that all *D*(*i*+*t*, *j*+*t*) have the same value, where *t* is no more than the length of the longest common prefix between *S*[*i*..*n*] and *P*[*j*..*m*]. Moreover, we know that (*i*+*t*+1, *j*+*t*+1) is in *L*(*q*+1). Hence, we can incrementally build *L*(1) … *L*(*k*). The longest common prefix can be computed in O(1) time by using a suffix tree. Since the size of any *L*(*q*) is at most *n*, the total time complexity is O(*kn*).

**解法**：

藉由Wagner–Fischer algorithm，我們可知一遞迴關係式*D*(*i*, *j*) = min( *D*(*i* – 1, *j*) + 1, *D*(*i*, *j *– 1) + 1, *D*(*i *– 1, *j *– 1) + [*T*[*i*] != *P*[*j*])，其中*D*(*i*, *j*)是*T*[1..*i*]的一個後綴與*P*[1..*j*]的最小edit distance。然而此演算法的時間複雜度為O(*mn*)。因為我們只需要考慮edit distance小於*k*的部份，我們可以加速此演算法。

表格*D*中的對角線是不會非遞增的，而且對角線上的一段連續同值的區間表示*S*與*T*的子字串完全符合。解法是基於計算計算出對角線的前沿，同時跳過連續值的區間。令*L*(*q*)為(*i*, *j*)的集合使得*D*(*i*, *j*) = *q*但是*D*(*i*-1, *j*-1) != *q*。如果(*i*, *j*)在*L*(*q*)之中，那麼我們知道對於所有小於*S*[*i*..*n*] and *P*[*j*..*m*]的最長相同前綴的*t*來說，*D*(*i*+*t*, *j*+*t*)會有一樣的值。此外我們知道(*i*+*t*+1, *j*+*t*+1)必在*L*(*q*+1)之中。因此我們可以循序的計算*L*(1) … *L*(*k*)。最長相同前綴可由suffix tree在O(1)時間內計算而得。因為*L*(*q*)的大小至多為*n*，總時間複雜度為O(*kn*)。

Reference: Stackoverflow

Gad M. Landau and Uzi Vishkin, “Fast string matching with k differences,” Journal of Computer and System Sciences Volume 37, Issue 1, August 1988, Pages 63–78