1

Given a (modified/broken) suffix tree, which stores in each edge the beginning and ending of the current substring, but not the substring itself, i.e a suffix tree that looks like this: enter image description here

this tree represents the string "banana" over the alphabet: {a, b, n}.

The algorithm I'm looking for is to find the string that a tree of that sort represents, for the example above, I would like the algorithm to find "banana". I would like to that in a complexity of O(|string|) where |string| is the length of the string that is being searched. It can be assumed that:

The size of the alphabet is constant and that every string starts from index 1.

avim
  • 979
  • 10
  • 25
wannabe programmer
  • 653
  • 1
  • 9
  • 23

1 Answers1

0
  1. Let's start with some polynomial time solution:

    • Let's divide all characters in the string into classes of equivalence.

    • We already know: it is a special $ symbol.

    • Induction hypothesis: let's assume that we have properly divided all characters of the suffix of length k into classes of equivalence. We can do it properly for the suffix of length k + 1, too.

    • Proof: let's iterate over all suffices of length i <- 1...k and check if the length of longest common prefix of the suffix of length k and the suffix of length i is not zero. It is non-zero iff the lowest common ancestor of the corresponding leaves is not the root of the tree. If we have found such a suffix, we know that it's first letter is equal to the first letter of the current suffix. So we can add the first letter of the suffix of length k + 1 to the appropriate class of equivalence. Otherwise, it belongs to its own equivalence class.

    • When all characters are divided into equivalence classes, we just need to assign a unique symbol to each class(if we need to maintain a correct lexicographical order, we can check which one of them goes earlier. To do this, we need to look at the order of edges that go from the root).

    • The time complexity is O(n ^ 3)(there are n suffices, we iterate over O(n) other suffices for each of them and we compute their lca in O(n)(I assume that we use a naive algorithm here)). So far, so good.

  2. Now let's use several observation to get a linear solution:

    • We don't really need the lca. We just need to check that it is not the root. Thus, we can divide all leaves into classes of equivalence based on their ancestor which is an immediate child of the root. It can done in linear time using a depth-first search. The longest common prefix of two suffices is non-empty iff they are in the same class.

    • We don't actually need to check all shorter suffices. We only need to check the closest one to the left and to the right in depth first search order. Finding the closest smaller number to the left and to the right from the given is a standard problem and it has a linear solution with a stack.

    • That's it: we check at most two other suffices for the given one and each check is O(1). We have a linear solution now.

This solution uses an assumption that such a string does exist. If this assumption is not feasible, we can construct some string using this algorithm, then build a suffix tree in linear for it using Ukkonnen's algorithm and check that it is exactly the same as the given one.

kraskevich
  • 18,368
  • 4
  • 33
  • 45