0

I have been tasked with identifying an efficient algorithm [O(n*log(n))] that, given a set of k Strings S = {s-1, s-2, s-3, ..., s-k}, will identify the longest substring T for each pair of strings (s-i, s-j), such that T is a suffix of s-i and a prefix of s-j, as well as the longest substring T for each pair of strings (s-j, s-i). n represents the added lengths of all k strings (n = |s-1| + |s-2| + |s-3| + ... + |s-k|).

Any thoughts? A link to a solution would be fine as well. Thanks in advance!

  • Question is vague. longest substring of suffix of `Si` and prefix of `Sj` is the whole of `Si + Sj`. Are we talking about longest common/Uncommon substring? – thebenman Oct 04 '17 at 06:37
  • 1
    What is the `n` in O(log n)? Did you mean `k`? The total number of characters in all the strings? Anyway, logarithmic time seems incredibly optimistic for an algorithm which produces k² result strings. Did you maybe mean O(n log n)? Please clarify. – rici Oct 04 '17 at 13:14
  • @thebenman : The goal is for the longest substring which appears in String Si as a suffix AND String Sj as a prefix. – SupposedlySleeping Oct 04 '17 at 16:44
  • @rici : I did indeed mean O(n*log(n)) time. My apologies. N represents the added lengths of all k strings. I will edit the question to reflect these clarifications. – SupposedlySleeping Oct 04 '17 at 16:46
  • The common prefix/suffix seems like the sort of problem which yields to a suffix tree. It's easy enough to construct a trie of all the strings in linear time, which allows ordered traversal; the suffix tree can also be created in linear time. – rici Oct 04 '17 at 18:52
  • I'm specifically looking for a solution which relies on Suffix Arrays. Any thoughts? – SupposedlySleeping Oct 04 '17 at 19:19

1 Answers1

1

Algorithm 4.10 on page 61 of the book Algorithmic Aspects of Bioinformatics gives a method of computing the longest common substring of a set of given strings using suffix trees

computing the longest common substring of a set of given strings using suffix trees

The article also explains how finding the longest common substring is possible

in linear time with respect to the size of the suffix tree, i.e. in O(n log n).

BioGeek
  • 21,897
  • 23
  • 83
  • 145