The Levenshtein distance (Edit distance) Problem

The Levenshtein distance (or Edit distance) is a way of quantifying how different two strings are from one another by counting the minimum number of operations required to transform one string into the other.

The Levenshtein distance between two words is the minimum number of single-character edits (i.e., insertions, deletions, or substitutions) required to change one word into the other. Each of these operations has a unit cost.

For example, the Levenshtein distance between kitten and sitting is 3. The minimal edit script that transforms the former into the latter is:

kitten —> sitten (substitution of s for k)
sitten —> sittin (substitution of i for e)
sittin —> sitting (insertion of g at the end)

Practice this problem

The Edit distance problem has optimal substructure. That means the problem can be broken down into smaller, simple “subproblems”, which can be broken down into yet simpler subproblems, and so on, until, finally, the solution becomes trivial.

Problem: Transform string X[1…m] into Y[1…n] by performing edit operations on string X.

Subproblem: Transform substring X[1…i] into Y[1…j] by performing edit operations on substring X.

Case 1: We have reached the end of either substring.

If substring X is empty, insert all remaining characters of substring Y into X. The cost of this operation is equal to the number of characters left in substring Y.

('', 'ABC') ——> ('ABC', 'ABC') (cost = 3)

If substring Y is empty, insert all remaining characters of substring X into Y. The cost of this operation is equal to the number of characters left in substring X.

('ABC', '') ——> ('', '') (cost = 3)

Case 2: The last characters of substring X and Y are the same.

If the last characters of substring X and substring Y matches, nothing needs to be done – simply recur for the remaining substring X[0…i-1], Y[0…j-1]. As no edit operation is involved, the cost will be 0.

('ACC', 'ABC') ——> ('AC', 'AB') (cost = 0)

Case 3: The last characters of substring X and Y are different.

If the last characters of substring X and Y are different, return the minimum of the following operations:

Insert the last character of Y into X. The size of Y reduces by 1, and X remains the same. This accounts for X[1…i], Y[1…j-1] as we move in on the target substring, but not in the source substring.

('ABA', 'ABC') ——> ('ABAC', 'ABC') == ('ABA', 'AB') (using case 2)
Delete the last character of X. The size of X reduces by 1, and Y remains the same. This accounts for X[1…i-1], Y[1…j] as we move in on the source string, but not in the target string.

('ABA', 'ABC') ——> ('AB', 'ABC')
Substitute (Replace) the current character of X by the current character of Y. The size of both X and Y reduces by 1. This accounts for X[1…i-1], Y[1…j-1] as we move in both the source and target string.

('ABA', 'ABC') —> ('ABC', 'ABC') == ('AB', 'AB') (using case 2)

It is basically the same as case 2, where the last two characters match, and we move in both the source and target string, except it costs an edit operation.

So, we can define the problem recursively as:

             | max(i, j)                            when min(i, j) = 0

dist[i][j] = | dist[i – 1][j – 1]                   when X[i-1] == Y[j-1]

             | 1 + minimum { dist[i – 1][j],        when X[i-1] != Y[j-1]
             |            dist[i][j – 1],
             |            dist[i – 1][j – 1] }

Following is the C++, Java, and Python implementation of the idea:

C++

#include <iostream>

using namespace std;

// Function to find Levenshtein distance between string `X` and `Y`.

// `m` and `n` is the total number of characters in `X` and `Y`, respectively

int dist(string X, int m, string Y, int n)

{

// base case: empty strings (case 1)

if (m == 0) {

return n;

}

if (n == 0) {

return m;

}

int cost;

// if the last characters of the strings match (case 2)

if (X[m - 1] == Y[n - 1]) {

cost = 0;

}

else {

cost = 1;

}

return min (min(dist(X, m - 1, Y, n) + 1, // deletion (case 3a))

dist(X, m, Y, n - 1) + 1), // insertion (case 3b))

dist(X, m - 1, Y, n - 1) + cost); // substitution (case 2 & 3c)

}

int main()

{

string X = "kitten", Y = "sitting";

cout << "The Levenshtein distance is " << dist(X, X.length(), Y, Y.length());

return 0;

}

Download Run Code

Output:

The Levenshtein distance is 3

Java

class Main

{

// Utility function to find the minimum of three numbers

public static int minimum(int a, int b, int c) {

return Integer.min(a, Integer.min(b, c));

}

// Function to find Levenshtein distance between string `X` and `Y`.

// `m` and `n` is the total number of characters in `X` and `Y`, respectively

public static int dist(String X, int m, String Y, int n)

{

// base case: empty strings (case 1)

if (m == 0) {

return n;

}

if (n == 0) {

return m;

}

// if the last characters of the strings match (case 2)

int cost = (X.charAt(m - 1) == Y.charAt(n - 1)) ? 0: 1;

return minimum(dist(X, m - 1, Y, n) + 1, // deletion (case 3a))

dist(X, m, Y, n - 1) + 1, // insertion (case 3b))

dist(X, m - 1, Y, n - 1) + cost); // substitution (case 2 & 3c)

}

public static void main(String[] args)

{

String X = "kitten", Y = "sitting";

System.out.println("The Levenshtein distance is " +

dist(X, X.length(), Y, Y.length()));

}

Download Run Code

Output:

The Levenshtein distance is 3

Python

# Function to find Levenshtein distance between string `X` and `Y`.

# `m` and `n` is the total number of characters in `X` and `Y`, respectively

def dist(X, m, Y, n):

# base case: empty strings (case 1)

if m == 0:

return n

if n == 0:

return m

# if the last characters of the strings match (case 2)

cost = 0 if (X[m - 1] == Y[n - 1]) else 1

return min(dist(X, m - 1, Y, n) + 1, # deletion (case 3a))

dist(X, m, Y, n - 1) + 1, # insertion (case 3b))

dist(X, m - 1, Y, n - 1) + cost) # substitution (case 2 + 3c)

if __name__ == '__main__':

X = 'kitten'

Y = 'sitting'

print('The Levenshtein distance is', dist(X, len(X), Y, len(Y)))

Download Run Code

Output:

The Levenshtein distance is 3

The time complexity of the above solution is exponential and occupies space in the call stack.

As seen above, the problem has optimal substructure. The above solution also exhibits overlapping subproblems. If we draw the solution’s recursion tree, we can see that the same subproblems are repeatedly computed. We know that problems with optimal substructure and overlapping subproblems can be solved using dynamic programming, in which subproblem solutions are memoized rather than computed repeatedly.

The memoized version follows the top-down approach since we first break the problem into subproblems and then calculate and store values. We can also solve this problem in a bottom-up manner. In the bottom-up approach, we solve smaller subproblems first, then solve larger subproblems from them.

The invariant maintained throughout the algorithm is that we can transform the initial segment X[1…i] into Y[1…j] using a minimum of T[i, j] operations. In the end, the bottom-right array element contains the answer.

For example, let X be ‘kitten’, and Y be ‘sitting’. The Levenshtein distance between X and Y is 3. The i'th row and j'th column in the table below show the Levenshtein distance of substring X[0…i-1] and Y[0…j-1].

Edit Distance Problem

Following is the C++, Java, and Python implementation of the idea:

C++

#include <iostream>

#include <cstring>

using namespace std;

// Function to find Levenshtein distance between string `X` and `Y`.

// `m` and `n` is the total number of characters in `X` and `Y`, respectively

int dist(string X, string Y)

{

int m = X.length();

int n = Y.length();

// For all pairs of `i` and `j`, `T[i, j]` will hold the Levenshtein distance

// between the first `i` characters of `X` and the first `j` characters of `Y`.

// Note that `T` holds `(m+1)×(n+1)` values.

int T[m + 1][n + 1];

// initialize `T` by all 0's

memset(T, 0, sizeof T);

// we can transform source prefixes into an empty string by

// dropping all characters

for (int i = 1; i <= m; i++) {

T[i][0] = i; // (case 1)

}

// we can reach target prefixes from empty source prefix

// by inserting every character

for (int j = 1; j <= n; j++) {

T[0][j] = j; // (case 1)

}

int substitutionCost;

// fill the lookup table in a bottom-up manner

for (int i = 1; i <= m; i++)

{

for (int j = 1; j <= n; j++)

{

if (X[i - 1] == Y[j - 1]) { // (case 2)

substitutionCost = 0; // (case 2)

}

else {

substitutionCost = 1; // (case 3c)

}

T[i][j] = min(min(T[i - 1][j] + 1, // deletion (case 3b)

T[i][j - 1] + 1), // insertion (case 3a)

T[i - 1][j - 1] + substitutionCost); // replace (case 2 & 3c)

}

return T[m][n];

}

int main()

{

string X = "kitten", Y = "sitting";

cout << "The Levenshtein distance is " << dist(X, Y);

return 0;

}

Download Run Code

Output:

The Levenshtein distance is 3

Java

class Main

{

// Utility function to find the minimum of three numbers

public static int minimum(int a, int b, int c) {

return Integer.min(a, Integer.min(b, c));

}

// Function to find Levenshtein distance between string `X` and `Y`.

public static int dist(String X, String Y)

{

// `m` and `n` is the total number of characters in `X` and `Y`, respectively

int m = X.length();

int n = Y.length();

// For all pairs of `i` and `j`, `T[i, j]` will hold the Levenshtein distance

// between the first `i` characters of `X` and the first `j` characters of `Y`.

// Note that `T` holds `(m+1)×(n+1)` values.

int[][] T = new int[m + 1][n + 1];

// we can transform source prefixes into an empty string by

// dropping all characters

for (int i = 1; i <= m; i++) {

T[i][0] = i; // (case 1)

}

// we can reach target prefixes from empty source prefix

// by inserting every character

for (int j = 1; j <= n; j++) {

T[0][j] = j; // (case 1)

}

int cost;

// fill the lookup table in a bottom-up manner

for (int i = 1; i <= m; i++)

{

for (int j = 1; j <= n; j++)

{

if (X.charAt(i-1) == Y.charAt(j-1)) { // (case 2)

cost = 0; // (case 2)

}

else {

cost = 1; // (case 3c)

}

T[i][j] = minimum(T[i - 1][j] + 1, // deletion (case 3b)

T[i][j - 1] + 1, // insertion (case 3a)

T[i - 1][j - 1] + cost); // replace (case 2 + 3c)

}

return T[m][n];

}

public static void main(String[] args)

{

String X = "kitten", Y = "sitting";

System.out.print("The Levenshtein distance is " + dist(X, Y));

}

Download Run Code

Output:

The Levenshtein distance is 3

Python

# Function to find Levenshtein distance between string `X` and `Y`.

def dist(X, Y):

# `m` and `n` is the total number of characters in `X` and `Y`, respectively

(m, n) = (len(X), len(Y))

# For all pairs of `i` and `j`, `T[i, j]` will hold the Levenshtein distance

# between the first `i` characters of `X` and the first `j` characters of `Y`.

# Note that `T` holds `(m+1)×(n+1)` values.

T = [[0 for x in range(n + 1)] for y in range(m + 1)]

# we can transform source prefixes into an empty string by

# dropping all characters

for i in range(1, m + 1):

T[i][0] = i # (case 1)

# we can reach target prefixes from empty source prefix

# by inserting every character

for j in range(1, n + 1):

T[0][j] = j # (case 1)

# fill the lookup table in a bottom-up manner

for i in range(1, m + 1):

for j in range(1, n + 1):

if X[i - 1] == Y[j - 1]: # (case 2)

cost = 0 # (case 2)

else:

cost = 1 # (case 3c)

T[i][j] = min(T[i - 1][j] + 1, # deletion (case 3b)

T[i][j - 1] + 1, # insertion (case 3a)

T[i - 1][j - 1] + cost) # replace (case 2 + 3c)

return T[m][n]

if __name__ == '__main__':

X = 'kitten'

Y = 'sitting'

print('The Levenshtein distance is', dist(X, Y))

Download Run Code

Output:

The Levenshtein distance is 3

The time complexity of the above solution is O(m.n) and requires O(m.n) extra space, where m is the length of the first string and n is the length of the second string. It turns out that only two rows of the table are needed for the construction if one does not want to reconstruct the edited input strings (the previous row and the current row being calculated).

Exercise: Modify iterative version to use only two matrix rows.

References: Levenshtein Distance – Wikipedia

The Levenshtein distance (Edit distance) Problem

C++

Java

Python

C++

Java

Python

Share this: