Edit distance is a way of quantifying how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other.

The **Levenshtein distance** between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. Each of these operations has unit cost.

For example, the Levenshtein distance between kitten and sitting is 3. A minimal edit script that transforms the former into the latter is:

* k*itten ->

*itten (substitution of s for k)*

**s**sitt

*n -> sitt*

**e****n (substitution of i for e)**

*i*sittin -> sittin

*(insertion of g at the end)*

**g**The Edit distance problem has an **optimal substructure**. That means the problem can be broken down into smaller, simple “subproblems”, which can be broken down into yet simpler subproblems, and so on, until, finally, the solution becomes trivial.

**Problem: **Convert string X[1..m] to Y[1..n] by performing edit operations on string X.

**Sub-problem:** Convert substring X[1..i] to Y[1..j] by performing edit operations on substring X.

**CASE 1: **We have reached the end of either substring

If substring X is empty, then we insert all remaining characters of substring Y to X and the cost of this operation is equal to number of characters left in substring Y.

(”, ‘ABC’) –> (‘ABC’, ‘ABC’) (cost = 3)

If substring Y is empty, then we delete all remaining characters of X to convert it into substring Y. The cost of this operation is equal to number of characters left in substring X.

(‘ABC’, ”) –> (”, ”) (cost = 3)

**CASE 2: **Last characters of substring X and substring Y are same

If last characters of substring X and substring Y matches, nothing needs to be done. We simply recurse for remaining substring X[0..i-1], Y[0..j-1]. As no edit operation is involved, the cost will be 0.

(‘AC**C**‘, ‘AB**C**‘) –> (‘AC’, ‘AB’) (cost = 0)

**CASE 3:** Last characters of substring X and substring Y are different

If the last characters of substring X and substring Y are different, then we return minimum of below three operations –

(‘AB**A**‘, ‘AB**C**‘) –> (‘ABA**C**‘, ‘AB**C**‘) == (‘ABA’, ‘AB’) (using case 2)

3b. Delete last character of X. The size of X reduces by 1 and size of Y remains the same. This accounts for X[1..i-1], Y[1..j] as we move in on source string, but not in target string.

(‘AB**A**‘, ‘AB**C**‘) –> (‘AB’, ‘ABC’)

3c. Substitute (Replace) current character of X by current character of Y. The size of both X and Y reduces by 1. This accounts for X[1..i-1], Y[1..j-1] as we move in both source string and target string.

(‘AB**A**‘, ‘AB**C**‘) – > (‘ABC’, ‘ABC’) == (‘AB’, ‘AB’) (using case 2)

It is basically same as case 2 where the last two characters matches and we move in both source string and target string except it costs an edit operation.

So we can define the problem recursively as –

| max(i, j) when min(i, j) = 0

dist[i][j] = | dist[i – 1][j – 1] when X[i-1] == Y[j-1]

| 1 + minimum { dist[i – 1][j], when X[i-1] != Y[j-1]

| dist[i][j – 1],

| dist[i – 1][j – 1] }

**C++ implementation –**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
#include <iostream> using namespace std; // Function to find Levenshtein Distance between string X and Y // m and n are the number of characters in X and Y respectively int dist(string X, int m, string Y, int n) { // base case: empty strings (case 1) if (m == 0) return n; if (n == 0) return m; int cost; // if last characters of the strings match (case 2) if (X[m - 1] == Y[n - 1]) cost = 0; else cost = 1; return min (min(dist(X, m - 1, Y, n) + 1, // deletion (case 3a)) dist(X, m, Y, n - 1) + 1), // insertion (case 3b)) dist(X, m - 1, Y, n - 1) + cost); // substitution (case 2 & 3c) } // main function int main() { string X = "kitten", Y = "sitting"; cout << "The Levenshtein Distance is " << dist(X, X.length(), Y, Y.length()); return 0; } |

**Output: **

The Levenshtein Distance is 3

The time complexity of above solution is O(3^{n}) and auxiliary space used by the program is constant.

As seen above, the problem has an **optimal substructure**. Above solution also exhibits **overlapping subproblems**. If we draw the recursion tree of the solution, we can see that the same sub-problems are getting computed again and again. We know that problems having optimal substructure and overlapping subproblems can be solved by using dynamic programming, in which subproblem solutions are memoized rather than computed again and again. The *Memo*ized version follows the top-down approach, since we first break the problem into subproblems and then calculate and store values. We can also solve this problem in bottom-up manner. In the bottom-up approach, we solve smaller sub-problems first, then solve larger sub-problems from them.

The invariant maintained throughout the algorithm is that we can transform the initial segment X[1..i] into Y[1..j] using a minimum of T[i,j] operations. At the end, the bottom-right element of the array contains the answer.

For example, let X be kitten and Y be sitting. The Levenshtein distance between X and Y is 3. The ith row and jth column in the table below shows the Levenshtein distance of substring X[0..i-1] and Y[0..j-1].

**C++ implementation –**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
#include <bits/stdc++.h> using namespace std; // Function to find Levenshtein Distance between string X and Y // m and n are the number of characters in X and Y respectively int dist(string X, int m, string Y, int n) { // for all i and j, T[i,j] will hold the Levenshtein distance between // the first i characters of X and the first j characters of Y // note that T has (m+1)*(n+1) values int T[m + 1][n + 1]; // set each element in T to zero memset(T, 0, sizeof(T)); // source prefixes can be transformed into empty string by // dropping all characters for (int i = 1; i <= m; i++) T[i][0] = i; // (case 1) // target prefixes can be reached from empty source prefix // by inserting every character for (int j = 1; j <= n; j++) T[0][j] = j; // (case 1) int substitutionCost; // fill the lookup table in bottom-up manner for (int i = 1; i <= m; i++) { for (int j = 1; j <= n; j++) { if (X[i - 1] == Y[j - 1]) // (case 2) substitutionCost = 0; // (case 2) else substitutionCost = 1; // (case 3c) T[i][j] = min(min(T[i - 1][j] + 1, // deletion (case 3b) T[i][j - 1] + 1), // insertion (case 3a) T[i - 1][j - 1] + substitutionCost);// replace (case 2 & 3c) } } return T[m][n]; } // main function int main() { string X = "kitten", Y = "sitting"; cout << "The Levenshtein Distance is " << dist(X, X.length(), Y, Y.length()); return 0; } |

**Output: **

The Levenshtein Distance is 3

The time complexity of above solution is O(mn) and auxiliary space used by the program is O(mn) where m and n are the number of characters in two strings. It turns out that only two rows of the table are needed for the construction if one does not want to reconstruct the edited input strings (the previous row and the current row being calculated).

**Exercise:**Modify Iterative version that uses only two matrix rows

**References:**https://en.wikipedia.org/wiki/Levenshtein_distance

**Thanks for reading.**

Please use ideone or C++ Shell or any other online compiler link to post code in comments.

Like us? Please spread the word and help us grow. Happy coding 🙂

Pingback: Implement Diff Utility - Techie Delight()