substring searching in a string using suffix tree..?

Question

I have read that :

Searching for a substring, pat[1..m], in txt[1..n], can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time).

but at each point, we will have to choose which branch to take, so like in n-ary tree, at each node, we will have to compare with all max n pointers in that node to decide which branch to take. Will this not bring n factor in complexity of this algorithm, somehow in picture

Then how above it says that substring can be found in O(m)?

What am I missing here?

How is suffix tree represented in practice? could we represent it using adjacency list as in graph? — xyz, Jun 08 '11 at 09:24

Konrad Rudolph · Accepted Answer · 2011-06-08T09:22:41.867

Then how above it says that substring can be found in O(m)?

By omission. You are correct that the runtime of searching in suffix trees is more complex than merely O(m).

However, it can indeed be sped up to O(m) if we trade off space requirements: we need to get the search at each node down to O(1) and we can do this by using an appropriate data structure (e.g. an array) which gives us the appropriate sub-tree for each letter in constant time.

For instance, assume that you’re using C++ for the implementation and your character (char) can contain 256 different values. Then the implementation of a node could look as follows:

struct node {
    char current_character;
    node* children[256];
};

Now, current_character refers to the character of the branch leading to the current node, and children is an array of child nodes. During the search, assume that you are currently at node u, and the next character in the input text is c. Then you will choose the next node as follows:

node* next = u->children[c];
if (next == 0) {
    // Child node does not exist => nothing found.
}
else {
    u = next;
    // Continue search with next …
}

Of course, this is only viable for very small alphabets (e.g. DNA for genomic sequences). In most common cases, the worst-case runtime of a suffix tree is indeed higher than O(m).

ok, any pointers to actual implementation of suffix trees? Please also look at my comment to other answer above. — xyz, Jun 08 '11 at 09:17

score 0 · Answer 2 · answered Jun 08 '11 at 09:12

0

If pointers to childs are in array indexed by the letter, only constant time is needed for each pattern letter

node = tree root
FOR i in 1..m
   node = child[pat[i]]

so the complexity is O(m).

answered Jun 08 '11 at 09:12

Michy

623
9
15

ok, I am not aware of how suffix trees are implemented in practice. but how do you actually keep an array indexible on characters? what is space complexity of this? there are 26 lower case alphabet characters. so with each node, do we keep an array of size 26*pointers. That sound too much wastage of space..? – xyz Jun 08 '11 at 09:15
In the other approach you can have a list of pointers in each node, so the time in one node is O(A) (A is alphabet size) and the time for finding substring is O(m * A). Assuming A as the constant we get O(m) again. – Michy Jun 08 '11 at 09:21
Is there a clear functional implementation, and description of this online. I'm surprised google searches have not been fruitful. Beyond the theory, how do you implement a suffix tree in javascript, and search for matching substrings, so that the closest result is returned? The purpose is an autocomplete field. C'mon gods of javascript. – zero_cool Feb 24 '17 at 20:05

substring searching in a string using suffix tree..?

2 Answers2