I think I have enough grasp of the LCS algorithm from this page. Specifically this psedo-code implementation: (m and n are the lengths of A and B)
int lcs_length(char * A, char * B) {
allocate storage for array L;
for (i = m; i >= 0; i--)
for (j = n; j >= 0; j--) {
if (A[i] == '\0' || B[j] == '\0') L[i,j] = 0;
else if (A[i] == B[j]) L[i,j] = 1 + L[i+1, j+1];
else L[i,j] = max(L[i+1, j], L[i, j+1]);
}
return L[0,0];
}
The L array is later backtracked to find the specific subsequence like so:
sequence S = empty;
i = 0;
j = 0;
while (i < m && j < n) {
if (A[i]==B[j]) {
add A[i] to end of S;
i++; j++;
}
else if (L[i+1,j] >= L[i,j+1]) i++;
else j++;
}
I have yet to rewrite this into Javascript, but for now I know that the implementation at Rossetta Code works just fine. So to my questions:
1. How do I modify the algorithm to only return the longest common subsequence where the parts of the sequence are of a given minimum length?
For example, "thisisatest" and "thimplestesting" returns "thistest", with the contiguous parts "thi", "s" and "test". Let's define 'limit' as a minimum requirement of contiguous characters for it to be added to the result. With a limit of 3 the result would be "thitest" and with a limit of 4 the result would be "test". For my uses I would like to not only get the length, but the actual sequence and its indices in the first string. It doesn't matter if that needs to be backtracked later or not.
2. Would such a modification reduce the complexity or increase it?
From what I understand, analysing the entire suffix tree might be a solution to find a subsequence that fits a limit? If correct, is that significantly more complex than the original algorithm?.
3. Can you optimize the LCS algorithm, modified or not, with the knowledge that the same source string is compared to a huge amount of target strings?
Currently I'm just iterating through the target strings finding the LCS and selecting the string with the longest subsequence. Is there any significant preprocessing that could be done on the source string to reduce the time?
Answers to any of my questions are welcome, or just hints on where to research further. Thank you for your time! :)