2

Given a string s, find the longest double suffix in time complexity O(|s|).

Example: for string banana, the LDS is na. For abaabaa it's baa.

Obviously I thought about using a suffix tree, but I'm having trouble to find double suffix in it.

Xtreme Joe
  • 115
  • 6
  • Is this your homework assignment? – aardrian Jul 20 '16 at 00:51
  • I think you can construct a z-array for the reverse of the string, then scan it looking for the largest element such that z[i] = i. See e.g. http://www.geeksforgeeks.org/z-algorithm-linear-time-pattern-searching-algorithm/ – Gene Jul 20 '16 at 01:08
  • @Gene - that's my (deleted) answer, but it doesn't work (at least not without some adaptations, so I deleted it...). Consider various lengths of single letter strings to see why... – Amit Jul 20 '16 at 04:47
  • 1
    @Amit Good point. But isn't it sufficient to just look for largest i s.t. z[i] >= i? The "extra matching tail" can just be ignored. – Gene Jul 20 '16 at 04:55
  • Apparently this is not as simple as it may looks, and perhaps cannot be done using a suffix tree in linear time complexity? – Xtreme Joe Jul 20 '16 at 05:17
  • @Gene do you have the answer? – Xtreme Joe Jul 20 '16 at 11:24
  • 1
    @XtremeJoe The algorithm I proposed seems to work fine with the modification Amit inspired. Construct a z-array for the reverse of the string, then scan it looking for the largest i such that z[i] >= i. Since z[i] is the length of the longest substring starting at i that matches a prefix of the string starting at 0, the values of k where z[k]>=k are lengths of repeated prefixes of length k. The largest such k must be the answer. The z-array can be constructed in linear time. Scanning it for the max double prefix is also O(n). If this doesn't work, I'd love to know why. – Gene Jul 21 '16 at 01:56
  • @Gene Does it handle the case: `LDS(anana)=na` not `ana` (overlapping) ? – Xtreme Joe Jul 21 '16 at 13:50
  • 1
    @XtremeJoe Yes. You should be able to trace this yourself. The z-array of the reverse anana will be [5,0,3,0,1]. The largest i such that z[i]>= i is 2. This says that the half-length of the double suffix is 2. – Gene Jul 26 '16 at 15:52

2 Answers2

0

Reverse the string and build sparse array P[i][j], where i is from 0 to log(n), j is from 0 to n-1, n is the length of the string. P[i][j] refers to the rank of the suffix starting from position j and length 2^i. So if P[i][j]=P[i][k], the first 2^i chars of the suffixes at indexes j and k are equal.

Now your problem reduces to finding a Longest Common Prefix for 0(start of the reversed string) and another suffix at index i, such that LCP >= i. Where LCP can be computed by simply using P array in log(n) time, by comparing first 2^x chars of these two suffixes and gradually reducing x.

Total complexity is n*log(n)*log(n). Here is the working C++ source code: https://ideone.com/aJCAYG

Yerken
  • 1,944
  • 1
  • 14
  • 18
0

I think that Gene's solution is the simpler to implement and since it does not rely on an arborescent structures but on arrays, it is likely more hardware friendly as well.

But since you mentioned suffix trees, let's look into a solution based on suffix trees! I will assume that you use an end token to mark the end of the string(s) you insert in the tree. To illustrate this, here is a representation of the suffix tree built for your abaabaa example:

$ - ##
b a a - $ - ## // Longest double suffix: P is the first dash, N the second
        b a a $ - ## // N' is the dash
a - $ - ##
    a - $ - ##
        b a a $ - ##
    b a a - $ - ##
            b a a $ - ##

When N is a node in a suffix tree, we will denote |N| the length of the substring represented by N.

How can you characterize a "double suffix" in a suffix tree? Well it is a terminal node N with a parent that has a specific property: let P be the parent node of a double suffix, then:

  • P has a transition to the suffix node N that only contains the end token ($ above) of the string.
  • Let suffix be the substring represented by the node P with an appended end token (baa$ in your example). If we walk down the tree from P, using suffix, we end up in another suffix node N' (walking down the tree won't be actually needed)
  • The substring represented by the node P is the double suffix (baa in our case).
  • We have the equalities |N'| = 2.|P| + 1 and |N| = |P| + 1

Given that, you only have to iterate over suffix nodes and test this condition. You can be greedy if you iterate suffixes in decreasing order of length: the first match is necessarily the longest double suffix.

Note that we can stop our search after having inspected the suffix of length |S|/2 and only iterate over suffixes of odd length (do not forget we add an end token to the string)

Complexity analysis

Building the suffix tree is O(|S|).
Let N' be a suffix node and N be the suffix node for the suffix of length (|N'|-1)/2 + 1. Assuming proper construction of the tree:

  • The suffixes can be stored in an array/vector in increasing order because the creation of the tree adds them in increasing order of length (at least with the Ukkonen's algorithm).
  • Thus accessing the suffix of length k is O(1)
  • Accessing the substring represented by a node of the tree is O(1), in particular, this applies to P the parent node of N and N'
  • Finding out if the transition from P to N only contains the end token ($) is O(1)
  • Checking if |N'| = 2.|P| + 1 is indeed O(1)

Since we are iterating over the suffix in decreasing order of length, we necessarily focus on the N' suffixes (the doubled suffix, ie baabaa$ in your example), so we just have to:

  • Get N the suffix node such that |N'| = 2.|N| - 1: O(1)
  • Get P the parent of the suffix node N: O(1)
  • Check that the transition from P to N contains only the end token $: O(1)

Proof: (We ignore the end token in the following proof)

The 3 steps above, if leading to a true evaluation, prove the existence of a suffix of length 2.|P| that starts with the substring represented by P, which is also a suffix. Since this substring is a suffix, the suffix of length 2.|P| necessarily ends with it and therefore is made of two occurrences of that substring QED.

Since we will do this step for at most (|S|/2 + 1)/2 suffixes, the identification step is therefore O(|S|) in the worst case.

The overall complexity is thus O(|S|).

Rerito
  • 5,886
  • 21
  • 47