2

I think I have enough grasp of the LCS algorithm from this page. Specifically this psedo-code implementation: (m and n are the lengths of A and B)

int lcs_length(char * A, char * B) {
  allocate storage for array L;
  for (i = m; i >= 0; i--)
    for (j = n; j >= 0; j--) {
      if (A[i] == '\0' || B[j] == '\0') L[i,j] = 0;
      else if (A[i] == B[j]) L[i,j] = 1 + L[i+1, j+1];
      else L[i,j] = max(L[i+1, j], L[i, j+1]);
    }
  return L[0,0];
}

The L array is later backtracked to find the specific subsequence like so:

sequence S = empty;
i = 0;
j = 0;
while (i < m && j < n) {
  if (A[i]==B[j]) {
    add A[i] to end of S;
    i++; j++;
  }
  else if (L[i+1,j] >= L[i,j+1]) i++;
  else j++;
}

I have yet to rewrite this into Javascript, but for now I know that the implementation at Rossetta Code works just fine. So to my questions:

1. How do I modify the algorithm to only return the longest common subsequence where the parts of the sequence are of a given minimum length?

For example, "thisisatest" and "thimplestesting" returns "thistest", with the contiguous parts "thi", "s" and "test". Let's define 'limit' as a minimum requirement of contiguous characters for it to be added to the result. With a limit of 3 the result would be "thitest" and with a limit of 4 the result would be "test". For my uses I would like to not only get the length, but the actual sequence and its indices in the first string. It doesn't matter if that needs to be backtracked later or not.

2. Would such a modification reduce the complexity or increase it?

From what I understand, analysing the entire suffix tree might be a solution to find a subsequence that fits a limit? If correct, is that significantly more complex than the original algorithm?.

3. Can you optimize the LCS algorithm, modified or not, with the knowledge that the same source string is compared to a huge amount of target strings?

Currently I'm just iterating through the target strings finding the LCS and selecting the string with the longest subsequence. Is there any significant preprocessing that could be done on the source string to reduce the time?

Answers to any of my questions are welcome, or just hints on where to research further. Thank you for your time! :)

  • 1) I believe it should just work not to compare single characters, but also the N characters around the position. Of course that works only for small constant limits, as it increases the complexity by that constant. – Bergi Mar 09 '17 at 20:01
  • 3) depends on how large the source strings is and how many target strings there are. In theory you could always compile the source strings into a (larger!) [finite state transducer](https://en.wikipedia.org/wiki/Finite-state_transducer) that can work on each target strings in linear time. – Bergi Mar 09 '17 at 20:04
  • @Bergi Thank you! Your solution to 1) is actually really great. I think that the limit is at most 3, so that could be it! Regarding 3), all strings are roughly between 20-300 but usually on the smaller side. The collection of target strings is as large as it can be before it becomes detrimental to performance. However, a single source string is only known before I need the best matching target string with the longest common subsequence. It doesn't seem that finite state works in my case. – electricmanamonkey Mar 09 '17 at 21:15

0 Answers0