6

Can we use a factor-oracle with suffix link (paper here) to compute the longest common substring of multiple strings? Here, substring means any part of the original string. For example "abc" is the substring of "ffabcgg", while "abg" is not.

I've found a way to compute the maximum length common substring of two strings s1 and s2. It works by concatenating the two strings using a character not in them, '$' for example. Then for each prefix of the concatenated string s with length i >= |s1| + 2, we calculate its LRS (longest repeated suffix) length lrs[i] and sp[i] (the end position of the first occurence of its LRS). Finally, the answer is

max{lrs[i]| i >= |s1| + 2 and sp[i] <= |s1|}

I've written a C++ program that uses this method, which can solve the problem within 200ms on my laptop when |s1|+|s2| <= 200000, using the factor oracle.

s1 = 'ffabcgg'
s2 = 'gfbcge'
s = s1+'$'+s2 
  = 'ffabcgg$gfbcge'
p: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
s:  f f a b c g g $ g f  b  c  g  e
sp: 0 1 0 0 0 0 6 0 6 1  4  5  6  0
lrs:0 1 0 0 0 0 1 0 1 1  1  2  3  0

ans = lrs[13] = 3

I know the both problems can be solved using suffix-array and suffix-tree with high efficiency, but I wonder if there is a method using factor oracle to solve it. I am interested in this because the factor oracle is easy to construct (with 30 lines of C++, suffix-array needs about 60, and suffix-tree needs 150), and it runs faster than suffix-array and suffix-tree.

You can test your method of the first problem in this OnlineJudge, and the second problem in here.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Ray
  • 1,647
  • 13
  • 16
  • @jogojapan Thanks for your great patience! I should apologize for my poor english. – Ray Aug 15 '12 at 01:57
  • Not at all (and I am not a native speaker either). Anyway, I think it's great to have questions about factor oracles on SO! – jogojapan Aug 15 '12 at 02:00
  • @Ray: Can you/do you share your implementation of factor oracle construction anywhere? I am interested in this topic, and I am generally better at reading source code than formal papers :) – 500 - Internal Server Error Jan 21 '15 at 22:34
  • 1
    @500-InternalServerError You can find the code at https://gist.github.com/ZhanruiLiang/d50bf9f17b58916c8bc7 . However, the code is so dirty(programming contest style) and the concepts are so tricky that you should read the paper to understand it. – Ray Jan 22 '15 at 10:14
  • Do you care about the factor oracle, or do you just want to solve the common substring problem? Because if you just care about the common substring problem, there's already a linear time algorithm: https://en.wikipedia.org/wiki/Longest_common_substring_problem – Craig Gidney Nov 04 '15 at 22:01
  • It sounds me like MSA multiple sequence alignment – Grijesh Chauhan May 05 '19 at 20:25

1 Answers1

0

Can we use a factor-oracle with suffix link (paper here) to compute the longest common substring of multiple strings?

I don't think the algorithm is a very good fit (it is designed to factor a single string) but you can use it by concatenating the original strings with a unique separator.

Given abcdefg and hijcdekl and mncdop, find the longest common substring cd:

# combine with unique joiners
>>> s = "abcdefg" + "1" + "hijcdekl" + "2" + "mncdop" 
>>> factor_oracle(s)
"cd"

As part of its linear-time and space algorithm, the factor-oracle quickly rediscover the break points between the input strings as part of its search for common factors (the unique joiners provide and immediate cue to stop extending the best factor found so far).

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485