Finding Common Ground - The Longest Common Subsequence (LCS) Problem Explained
Hey everyone, and welcome back to the blog! In the world of computer science, especially when dealing with strings or sequences, we often encounter problems that involve finding commonalities. One such fundamental and widely studied problem is finding the Longest Common Subsequence (LCS) between two sequences.
It's not just an academic puzzle; understanding LCS has applications in bioinformatics (comparing DNA sequences), data comparison (like the diff utility), and more. The beauty of LCS lies in its elegant solution using Dynamic Programming. Today, let's unravel this problem, explore its solution through both iterative (tabulation) and recursive (memoization) DP approaches, and understand why it works.
The Problem: What's a Longest Common Subsequence?
Given two strings (or sequences), s1 and s2, the task is to find the length of the longest subsequence that is common to both of them.
Let's clarify what a subsequence is: A subsequence is a sequence that can be derived from another sequence by deleting zero or more elements without changing the order of the remaining elements.
For example:
- If
s1 = "AGGTAB" - And
s2 = "GXTXAYB"
A common subsequence could be "GTAB" (length 4). Notice that the characters G, T, A, B appear in that order in both s1 and s2, although not necessarily contiguously. "GTAB" is, in fact, the longest common subsequence for these two strings.
The problem we're solving is to find the length of this longest common subsequence.
Solving with Dynamic Programming: Two Flavors
The LCS problem exhibits optimal substructure and overlapping subproblems, the two hallmarks that make it suitable for Dynamic Programming (DP). This means we can break the problem down into smaller, related subproblems, solve them, and use their solutions to build up the solution to the original problem.
1. Tabulation (Iterative Bottom-Up Approach)
In the tabulation method, we build a 2D table (let's call it dp) from the bottom up, where dp[i][j] stores the length of the LCS of the first i characters of s1 (i.e., s1[0...i-1]) and the first j characters of s2 (i.e., s2[0...j-1]).
Pseudocode: Tabulation
// s1, s2: The input strings
function lcs_tabulation(s1, s2):
n = length(s1) // Length of the first string
m = length(s2) // Length of the second string
// Create a DP table of size (n+1) x (m+1)
// dp[i][j] will store the length of LCS of s1[0..i-1] and s2[0..j-1]
dp = array of size (n+1) x (m+1)
// Fill dp table in bottom-up manner
for i = 0 to n: // Iterate through lengths of prefix of s1
for j = 0 to m: // Iterate through lengths of prefix of s2
if i == 0 or j == 0:
// If either string (or prefix) is empty, LCS is 0
dp[i][j] = 0
else if s1[i-1] == s2[j-1]:
// If the last characters match, they are part of the LCS.
// Add 1 to the LCS of the strings without these last characters.
dp[i][j] = 1 + dp[i-1][j-1]
else:
// If the last characters don't match, the LCS is the maximum of:
// 1. LCS of s1[0..i-2] and s2[0..j-1] (excluding last char of s1)
// 2. LCS of s1[0..i-1] and s2[0..j-2] (excluding last char of s2)
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
// dp[n][m] contains the length of LCS for s1[0..n-1] and s2[0..m-1]
return dp[n][m]Explanation of Tabulation:
- We create a 2D table
dpof size(n+1) x (m+1), wherenis the length ofs1andmis the length ofs2. - Initialization (Base Cases): The first row (
dp[0][j]) and the first column (dp[i][0]) are filled with 0s. This is because if one of the strings (or prefixes) is empty, their longest common subsequence has a length of 0. - Filling the Table: We iterate through the table, filling each cell dp[i][j] based on the characters s1[i-1] and s2[j-1] (using i-1 and j-1 because our DP table is 1-indexed relative to string prefixes, while strings are often 0-indexed).
- If
s1[i-1] == s2[j-1]: The current characters match! This character is part of the LCS. So, the length of the LCS fors1[0...i-1]ands2[0...j-1]is 1 + the length of the LCS fors1[0...i-2]ands2[0...j-2](which isdp[i-1][j-1]). - If
s1[i-1] != s2[j-1]: The current characters do not match. In this case, the LCS ofs1[0...i-1]ands2[0...j-1]must be the longer of two possibilities:- The LCS of
s1[0...i-2]ands2[0...j-1](i.e.,dp[i-1][j], effectively ignoring the current character ofs1). - The LCS of
s1[0...i-1]ands2[0...j-2](i.e.,dp[i][j-1], effectively ignoring the current character ofs2). We take the max of these two values.
- The LCS of
- If
- The final answer, the length of the LCS of the entire strings
s1ands2, will be stored indp[n][m].
2. Memoization (Recursive Top-Down Approach)
The memoization approach solves the problem recursively but stores the results of subproblems in a lookup table (memoization table) to avoid recomputing them.
Pseudocode: Memoization
// Main function to set up memoization
function lcs(s1, s2):
n = length(s1)
m = length(s2)
// Create a memoization table, initialized with a value like -1 (to indicate not computed)
memo = array of size (n+1) x (m+1) filled with -1
// Call the recursive helper function
// Note: passing n and m representing current lengths of s1 and s2 being considered
return lcs_memoization_helper(s1, s2, n, m, memo)
// Recursive helper function with memoization
function lcs_memoization_helper(s1, s2, i, j, memo):
// i: current length of prefix of s1 being considered (s1[0...i-1])
// j: current length of prefix of s2 being considered (s2[0...j-1])
// Base Case: If either string prefix is empty
if i == 0 or j == 0:
return 0
// If this subproblem's result is already computed, return it
if memo[i][j] != -1:
return memo[i][j]
// If the last characters of the current prefixes match
if s1[i-1] == s2[j-1]:
memo[i][j] = 1 + lcs_memoization_helper(s1, s2, i-1, j-1, memo)
else:
// If characters don't match, explore two possibilities and take the max
val1 = lcs_memoization_helper(s1, s2, i-1, j, memo) // Exclude char from s1
val2 = lcs_memoization_helper(s1, s2, i, j-1, memo) // Exclude char from s2
memo[i][j] = max(val1, val2)
return memo[i][j]Explanation of Memoization:
- The main
lcsfunction initializes a memoization tablememo(often filled with a sentinel value like -1 to indicate that a subproblem hasn't been solved yet). It then calls a recursive helper function. - The
lcs_memoization_helperfunction takes the current lengthsiandjof the prefixes ofs1ands2being considered. - Base Cases: If
iorjis 0 (meaning one of the prefixes is empty), the LCS is 0. - Memoization Check: Before any computation, it checks
memo[i][j]. If it's not -1, the result for this subproblem has already been computed, so it's returned directly. - Recursive Step:
- If
s1[i-1] == s2[j-1](current characters match), the LCS length is1 + LCSofs1up toi-1ands2up toj-1. This result is stored in memo[i][j]. - If they don't match, we recursively find the LCS by:
- Excluding the last character of
s1(call withi-1, j). - Excluding the last character of
s2(call withi, j-1). The maximum of these two recursive calls is the answer formemo[i][j].
- Excluding the last character of
- If
- The result stored in
memo[i][j]is returned.
Proof of Correctness (Intuition)
The correctness for both DP approaches stems from the optimal substructure of the LCS problem:
- Case 1: Last characters match (s1[i-1] == s2[j-1]): The LCS must include this character. The remaining problem is to find the LCS of the preceding substrings s1[0...i-2] and s2[0...j-2]. The length is 1 + LCS(s1_prefix_i-1, s2_prefix_j-1).
- Case 2: Last characters don't match (s1[i-1] != s2[j-1]): This matching pair cannot be part of the LCS. So, the LCS is either the LCS of s1[0...i-2] and s2[0...j-1] (if we discard s1[i-1]) OR the LCS of s1[0...i-1] and s2[0...j-2] (if we discard s2[j-1]). We take the maximum of these two.
Both tabulation (building solutions to subproblems iteratively) and memoization (solving subproblems recursively and storing results) correctly explore these conditions and ensure that each subproblem is solved only once.
Space & Time Complexity Analysis
-
Time Complexity: For both the tabulation and memoization approaches, the time complexity is O(n * m), where
nis the length ofs1andmis the length ofs2.- In tabulation, we fill an
(n+1) x (m+1)table, and each cell takes constant time to compute. - In memoization, there are
(n+1) x (m+1)distinct subproblems. Each subproblem is computed only once due to memoization, and each computation involves constant time work (comparisons, recursive calls to already computed or base cases).
- In tabulation, we fill an
-
Space Complexity:
- Tabulation Approach: Requires a 2D array
dpof size(n+1) x (m+1)to store the LCS lengths. Thus, the space complexity is O(n * m). (It's possible to optimize space to O(min(n,m)) if only the length is needed, by only keeping track of the previous row/column). - Memoization Approach: Requires a 2D array
memoof size(n+1) x (m+1)for storing results, and also space for the recursion call stack, which can go up toO(n+m)in the worst case. So, the space complexity is dominated by the memoization table, making it O(n * m).
- Tabulation Approach: Requires a 2D array
Key Takeaways
- The Longest Common Subsequence (LCS) problem is a classic dynamic programming problem aiming to find the length of the longest subsequence common to two given strings.
- It can be solved efficiently using either tabulation (iterative DP) or memoization (recursive DP).
- Both approaches rely on breaking the problem into overlapping subproblems and building up the solution.
- The time complexity for both standard DP solutions is O(n * m), and the space complexity is also O(n * m).
Understanding LCS and its DP solution is fundamental for tackling many sequence alignment and comparison problems in bioinformatics, text processing, and beyond!