I think that Gene's solution is the simpler to implement and since it does not rely on an arborescent structures but on arrays, it is likely more hardware friendly as well.
But since you mentioned suffix trees, let's look into a solution based on suffix trees! I will assume that you use an end token to mark the end of the string(s) you insert in the tree. To illustrate this, here is a representation of the suffix tree built for your abaabaa
example:
$ - ##
b a a - $ - ## // Longest double suffix: P is the first dash, N the second
b a a $ - ## // N' is the dash
a - $ - ##
a - $ - ##
b a a $ - ##
b a a - $ - ##
b a a $ - ##
When N is a node in a suffix tree, we will denote |N| the length of the substring represented by N.
How can you characterize a "double suffix" in a suffix tree? Well it is a terminal node N with a parent that has a specific property: let P be the parent node of a double suffix, then:
- P has a transition to the suffix node N that only contains the end token (
$
above) of the string.
- Let suffix be the substring represented by the node P with an appended end token (
baa$
in your example). If we walk down the tree from P, using suffix, we end up in another suffix node N' (walking down the tree won't be actually needed)
- The substring represented by the node P is the double suffix (
baa
in our case).
- We have the equalities |N'| = 2.|P| + 1 and |N| = |P| + 1
Given that, you only have to iterate over suffix nodes and test this condition. You can be greedy if you iterate suffixes in decreasing order of length: the first match is necessarily the longest double suffix.
Note that we can stop our search after having inspected the suffix of length |S|/2 and only iterate over suffixes of odd length (do not forget we add an end token to the string)
Complexity analysis
Building the suffix tree is O(|S|)
.
Let N' be a suffix node and N
be the suffix node for the suffix of length (|N'|-1)/2 + 1. Assuming proper construction of the tree:
- The suffixes can be stored in an array/vector in increasing order because the creation of the tree adds them in increasing order of length (at least with the Ukkonen's algorithm).
- Thus accessing the suffix of length k is
O(1)
- Accessing the substring represented by a node of the tree is
O(1)
, in particular, this applies to P the parent node of N and N'
- Finding out if the transition from P to N only contains the end token (
$
) is O(1)
- Checking if |N'| = 2.|P| + 1 is indeed
O(1)
Since we are iterating over the suffix in decreasing order of length, we necessarily focus on the N'
suffixes (the doubled suffix, ie baabaa$
in your example), so we just have to:
- Get N the suffix node such that
|N'| = 2.|N| - 1
: O(1)
- Get P the parent of the suffix node
N
: O(1)
- Check that the transition from P to N contains only the end token
$
: O(1)
Proof: (We ignore the end token in the following proof)
The 3 steps above, if leading to a true evaluation, prove the existence of a suffix of length 2.|P| that starts with the substring represented by P, which is also a suffix. Since this substring is a suffix, the suffix of length 2.|P| necessarily ends with it and therefore is made of two occurrences of that substring QED.
Since we will do this step for at most (|S|/2 + 1)/2
suffixes, the identification step is therefore O(|S|)
in the worst case.
The overall complexity is thus O(|S|)
.