Why hidden factor on suffix tree space efficiency is 20?

Question

In general suffix trees are said to be less space efficient than suffix array. More specifically the approximation upper bound O(n) space efficiency hides a factor of 20 compared with that of a suffix array which approximates 4. Why this is happening?

score 3 · Answer 1 · answered Apr 06 '16 at 18:09

Typically, a suffix tree is represented by having each node store one pointer per character in the alphabet, with that pointer indicating where the child node is for the indicated character. Each child pointer is also annotated with a pair of indices into the original string indicating what range of characters from the original string is used to label the given edge. This means that for each character in your alphabet (plus the $ character), each suffix tree node will need to store one pointer and two machine words. This means that if you're doing something in a computational genomics application where the alphabet is {A, C, T, G}, for example, you'd need fifteen machine words per node in the suffix tree. The number of nodes in a suffix tree is at most 2n - 1, where n is the number of suffixes of the string, so you're talking about needing roughly 30n machine words.

Contrast this with a suffix array, where for each character in the string you just need one machine word (the index of the suffix), so there are a total of n machine words needed to store the suffix array. This is a substantial savings over the original suffix tree. Usually, suffix arrays are paired with LCP arrays (which give more insight into the structure of the array), which requires another n - 1 machine words, so you're coming out to a total of roughly 2n - 1 machine words needed. This is a huge savings over the suffix tree, which is one of the reasons why suffix arrays are used so much in practice.

I got a bit confused by your answer. Do we agree that we have: n-1 leaves, 2n-2 edges and n-1 internal nodes at most? How this 30 is computed then? How many pointers i need for internal node, edge and leave? — curious, Apr 07 '16 at 00:45
There are two separate quantities - the total number of nodes and the total space per node. The huge space blowup in suffix trees has to do with the amount of memory stored per node in the tree (15 machine words) and the fact that there are more nodes in a suffix tree (at most 2n-1) than entries in a suffix array (n), so you have more space per entry times more entries period in the suffix tree. — templatetypedef, Apr 07 '16 at 01:36
if your alphabet has four characters in it, then each node needs to potentially store a child pointer for each of those characters plus the end of string marker. Each pointer is a word. Additionally, you need to know the whole substring associated with that child pointer, which requires pointers to the start and end of the range of the string that appears on the edge. That's two more words per character. In total, that's three words per each of five characters, which works out to 15 words per node. — templatetypedef, Apr 07 '16 at 04:47
I literally just taught a class on this today! I talk about the space usage of suffix trees toward the middle and end. In case you're curious, the slides are [available online](http://web.stanford.edu/class/cs166/lectures/03/Slides03.pdf). — templatetypedef, Apr 08 '16 at 02:10

Why hidden factor on suffix tree space efficiency is 20?

1 Answers1