Input: two strings A and B.
Output: a set of repeated, non overlapping substrings
I have to find all the repeated strings, each of which has to occur in both(!) strings at least once. So for instance, let
A = "xyabcxeeeyabczeee" and B = "yxabcxabee".
Then a valid output would be {"abcx","ab","ee"} but not "eee", since it occurs only in string A.
I think this problem is very related to the "supermaximal repeat" problem. Here is a definition:
Maximal repeated pair : A pair of identical substrings alpha and beta in S such that extending alpha and beta in either direction would destroy the equality of the two strings It is represented as a triplet (position1,position2, length)
Maximal repeat : “A substring of S that occurs in a maximal pair in S”. Example: abc in S = xabcyiiizabcqabcyrxar. Note: There can be numerous maximal repeated pairs, but there can be only a limited number of maximal repeats.
Supermaximal repeat “A maximal repeat that never occurs as a substring of any other maximal repeat” Example: abcy in S = xabcyiiizabcqabcyrxar.
An algorithm for finding all supermaximal repeats is described in "Algorithms on strings, trees and sequences", but only for suffix trees.
It works by: 1.) finding all left-diverse nodes using DFS
For each position i in S, S(i-1) is called the left character i. Left character of a leaf in T(S) is the left character of the suffix position represented by that leaf. An internal node v in T(S) is called left-diverse if at least two leaves in v’s subtree have different left characters.
2.) applying theorem 7.12.4 on those nodes:
A left diverse internal node v represents a supermaximal repeat a if and only if all of v's children are leaves, and each has a distinct left character
Both strings A and B probably have to be concatenated and when we check v's leaves in step two we also have to impose an additional constraint, that there has to be at least one distinct left character from strings A and B. This can be done by comparing their position against the length of A. If position(left character) > length(A), then left character is in A, else in B.
Can you help me solve this problem with suffix + lcp arrays?