10

We've got a String.

ABAEABABEABE

Now we've got to check whether there exist a substring that is followed next by another substring that's exactly the same as the first one.

In this example: ABAEABABEABE
ABE is followed by ABE and that are two identical substrings.

In this example:

AAB

It would be simply A, beacuse A is followed by another A.

In this example:
ABCDEFGHIJKLMNO
There doesn't exist such a substring, so the answer would be NO.


I only managed to find an algorithm that would run in O(n^2). That is getting Hashes and its prefixes. Then for each letter we expand simply and check all the words ending on that letter. There are n letters. We need to expand it n times. So it's O(n^2). I believe there should be a O(n log n) algorithm for this problem.

Does anybody have a better idea?

user10101
  • 1,704
  • 2
  • 20
  • 49
Reiji Azuma
  • 101
  • 4

2 Answers2

3

I guess you want the longest substring possible that follows this pattern.

The first thing to do is to build a Suffix tree of the input string. Using Ukkonen's algorithm, this is O(n).

Now, how does the condition you provided translate in the suffix tree? First things first, you are looking for a repeated substring [1]. Repeated substrings will appear as internal nodes of the suffix tree. The maximum number of nodes in a suffix tree built from a n-char string is 2n - 1.

You can build a Max-Heap containing such repeated substrings, using their length (number of chars). You do NOT keep substrings of length superior to N/2 (see [1]). This is O(N) where N is the number of internal nodes of the suffix tree. For any suffix tree:

0 ≤ Nn - 2

Now, you take the max out of the priority queue and process the internal node i you obtained:

  1. Let Si be the substring related to i, k = 0 and curnode = i
  2. While k < length(Si)
    1. If the key from i to a child of i is equal to Si[k], then k = k+1
    2. Else break the loop.
  3. If k == length(Si), then the substring is a match. Else, you proceed to the next substring.

Complexity summary

Let n be the length of the query string.

  • Building the suffix tree : O(n)
  • Building the Max-heap of repeated substrings: [3]
    • Identifying the repeated substrings (ie. internal nodes) and storing them in an array: O(n)
    • Heapify the array: O(n)
  • Finding the best match: O(n².log(n)) [2]

Hence the overall worst case complexity is the sum of the above and is O(n².log(n)).

Notes

I made the algorithm above... Hence it is suboptimal, if you are brave enough, you can go through this paper that describes a linear time algorithm! In any case, Suffix trees are a key to this problem so I suggest you study them thoroughly.

[1]: Warning, repeated substrings may partially overlap!

[2]: Actually, the worst case complexity is better than this very naive upper bound but I don't know how to prove it (yet?!). For example, if there were n - 2 internal nodes, that would mean that the original string consists of n occurrences of the same character. In that case, the first substring we check is a match => it's O(n.log(n)).

[3]: If we replace the heap construction by a regular sort (O(n.log(n))), the final comparison step runs in O(n²) instead of O(n².log(n)) ... Taking down the overall complexity between O(n.log(n)) (due to the sorting step) and O(n²).

Community
  • 1
  • 1
Rerito
  • 5,886
  • 21
  • 47
  • I think that time complexity is wrong. You need O(length(S_i)) to check whether the first heap entry i represents a tandem repeat (since it might be that the second copy of S_i appears elsewhere in the string, and the string that follows S_i matches it only for the first length(S_i)/2 characters). So you then need to pop the next entry from the heap, and try it. You might need to repeat this length(S_i)/2-1 times. – j_random_hacker Jan 28 '15 at 13:28
  • @j_random_hacker You're right, K.log(n) is the lower bound (where K is the length of the longest repeated substring (<= n/2)). Nonetheless it seems that the real complexity is still lower than the naive worst scenario (checking all repeated substrings in the heap => O(n².log(n))) – Rerito Jan 28 '15 at 13:36
  • I see you've edited, but there are still a couple of references to the original time bound in your answer. I agree that it's better than the O(n^3) brute force approach, but interestingly it's not as good as just running the Z algorithm (e.g. http://codeforces.com/blog/entry/3107) starting at each position in turn, which will take O(n^2) and doesn't need a suffix tree. – j_random_hacker Jan 28 '15 at 14:04
  • @j_random_hacker That must come from the fact I use the suffix tree in a very suboptimal way. I may not be translating the substring properties we require well enough. Anyway even though the lower bound is n.log(n)) and the upper bound O(n².log(n)), I think it's far better than this naive upper bound – Rerito Jan 28 '15 at 14:20
0

This problem could be solved with 'divide and conquer' Main-Lorentz algorithm:
Michael Main, Richard J. Lorentz. An O (n log n) Algorithm for Finding All Repetitions in a String [1982]

Edit: algorithm description and C++ implementation in Russian (might be translated with Chrome browser)

There exists also linear-time algorithm (don't know about practical implementations)

MBo
  • 77,366
  • 5
  • 53
  • 86
  • 1
    Longest common prefix does not guarantee it will be one after the other. – amit Jan 28 '15 at 09:52
  • Also - finding all repetitions in a string does not guarantee "one after the other", and I am skeptic about `O(nlogn)` time, since there are O(n^2) repetitions in "aaaaaaaa....aa", it must be dependent on the size of the output as well - and that might be an overkill to find **all** reprtitions in this case. – amit Jan 28 '15 at 09:57
  • @amit OK, removed that variant – MBo Jan 28 '15 at 09:58
  • @amit 'Crochemore triplets' allow to code all tandem repetitions with nlogn such triplets (using repetition counts) – MBo Jan 28 '15 at 10:02
  • Ok, but if you have only the count - and not the repetitions themselves - how would you find out if they appear one after the other? I am not saying it cannot be done nor this is not the right direction - but the solution as is, is lacking details. – amit Jan 28 '15 at 10:04