Saturday, August 29, 2015

Dynamic programming - The Longest Common Subsequence Problem

The longest common subsequence problem is a simple problem that illustrates dynamic programming.

Longest Common Subsequence Problem

The problem is given two strings A and B of length m and n respectively, determine the length of the longest subsequence that is common in A and B. For example, A = xyxzx and B = zxyzy, then the longest common subsequence of A and B is xyz and the length is 3.

Brute-force Method

One way to solve the problem is to enumerate all the \(2^m\) subsequences of A, and for each subsequence check if it is also a subsequence in B. The time complexity of this method is \(\Theta(n2^m)\).

Dynamic Programming Method

Another way to solve this problem is using dynamic programming. It is easy to observe the formula below. \(a_i\) is the character at index i in A. \(b_j\) is the character at index j in B. L[i, j] is the length of the longest common subsequence in the substring of A[0..i] and the substring of B[0..j]. 
  • If \(a_i = b_j\), L[i, j] = L[i - 1, j - 1] + 1.
  • If \(a_i \neq b_j\), L[i, j] = max{L[i, j - 1], L[i - 1, j]}.
For example, A = xyxzx, B = zxyzy. Consider substring of A[0..2] = xyx and substring of B[0..2] = zxy. The L[2, 2] = 2 and the longest common subsequence is xy. Then consider A[0..3] = xyxz and B[0..3]=zxyz. Because \(a_3\) = z == \(b_3\) = z, it is clear that the L[3, 3] = 3 = L[2, 2] + 1 and the longest common subsequence is xyz. Then let's consider  A[0..3] = xyxz and B[0..4] = zxyzy. Because \(a_3\) = z \(\neq\) \(b_4\) = y, the longest common subsequence should still be xyz and the L[3, 4] = L[3, 3] = 3.

We can use an (m + 1) \(\times\) (n + 1) table to store the values of L[i, j]. And we fill the table row by row using the above observation. Then L[m, n] has the length of the longest common subsequence in A and B. The pseudo code is described in LCS.

Algorithm 6 LCS
Input: Two strings A and B of length m and n respectively.
Output: The length of the longest common subsequence of A and B.
  1. for i = 0 to m
  2.     for j = 0 to n
  3.         if (i == 0 and j == 0) then L[i, j] = 0
  4.         else if (\(a_i\) == \(b_j\)) then L[i, j] = L[i - 1, j - 1] + 1
  5.         else L[i, j] = max{L[i, j - 1], L[i - 1, j]}
  6.         end if
  7.     end for
  8. end for
  9. return L[m, n]
A JAVA implementation of this algorithm can be found on GitHub LCS.

An Example

Running LCS algorithm with input A = "abcdeabcd" and B = "acebde" will generate the table below. The result is 5 and the longest common subsequence is "acebd".

       a  c  e  b  d  e
  [0, 0, 0, 0, 0, 0, 0]
a[0, 1, 1, 1, 1, 1, 1]
b[0, 1, 1, 1, 2, 2, 2]
c[0, 1, 2, 2, 2, 2, 2]
d[0, 1, 2, 2, 2, 3, 3]
e[0, 1, 2, 3, 3, 3, 4]
a[0, 1, 2, 3, 3, 3, 4]
b[0, 1, 2, 3, 4, 4, 4]
c[0, 1, 2, 3, 4, 4, 4]
d[0, 1, 2, 3, 4, 5, 5]


Time Complexity

Filling each entry in the table requires \(\Theta(1)\) time. The size of the table is \(\Theta(mn)\), so the time complexity of LCS algorithm is \(\Theta(mn)\), which is the total time of filling the whole table.

No comments:

Post a Comment