I implemented (see code below) the absolute minimal generalized suffix tree building algorithm. I wrote a unit test and it seems to work as expected (finds the right substring at the right position) . But the tree is impractically large. Question: Did I make a mistake somewhere, or are suffix trees in this basic form only usable for very short texts?
Statistics
I want to use this to search in large bodies of text: multiple 15-20Mb text files, or e.g. 40'000 string ~60 characters each.
When I build the 40'000 string 2.5Mb suffix tree (it's a list if firm names), the tree takes 400Mb. I could probably optimize it about 4x, but even then it'd be more than 40 bytes per character of original text. Is that normal? The estimates in literature suggest to me this is high.
Looking at the average branching factor per level of the tree, they are: 80 40 8 3 2 1 1 1 ... i.e. only the first 3-5 levels of the tree actually branch. I'm much better off building 3-5 levels and keeping long text suffix nodes. Everything below 5th level is basically linked lists of characters. Is that expected performance? It makes intuitive sense, there aren't many 6+ character substrings shared between firm names.
In further experiment with 15MB text, I run out of 2Gb of memory after I add the first 10'000 suffixes (long to short). There's almost no repeating substrings, so the tree cannot reuse much. I can absolutely see how suffix array is practically useful, it takes fixed 2-4 bytes per character, and requires only 24 string compares per search in 20Mb of text. I can't possibly see how suffix tree with as many vertices as there are unique substrings in text can fit in memory. It's O(n^2) nodes for a string with all unique characters, but it seems still very superlinear for an English text. How can suffix tree work for large texts? The papers and questions I read seem to imply it's supposed to be usable, hence my questions here.
Questions
Did I make a mistake somewhere that makes the tree bigger than it should be?
Is it correct that the manner of building suffix tree should have no bearing on the final tree shape?
Mine is not Ukkonen algorithm but just brute-force adding all suffixes to the tree to simplify the code and data structure (no suffix link field) and to compare perf to Ukkonen later. The method of building should only affect the speed of construction, not the size or shape of the tree. Either way, building the tree is incremental, so there's no intermediate result that's bigger than final tree.
Here's the code:
#include <vector>
#include <assert.h>
class Node
{public:
char c; // 0 is terminator. terminators have no children
Node(char _c) { c = _c; }
};
class Terminator
{
public:
Terminator(int i, int p ) { id = i; pos = p; }
bool operator == (const Terminator &that)const { return id == that.id && pos == that.pos; }
int id; // id of the string
int pos; //position of this substring in the string
};
class Vertex : public Node
{
public:
static size_t s_nCount;
std::vector< Vertex* > children; // interior tree nodes; char != 0
std::vector< Terminator >terminators;
Vertex(char c) :Node(c) { s_nCount++; }
~Vertex() { for (Vertex*v : children) delete v; }
//void* operator new (size_t count) { return g_GstAlloc.Alloc(count); }
//void operator delete(void *p) { g_GstAlloc.Free(p); }
void getDepthCounts(std::vector<unsigned> &depth, size_t nLevel = 0)const
{
if (depth.size() <= nLevel)
depth.resize(nLevel + 1);
depth[nLevel]++;
for (Vertex*v : children)
v->getDepthCounts(depth,nLevel + 1);
}
Vertex *getOrCreateChildVertex(char c )
{
Vertex *out = getChild(c);
if (!out)
{
out = new Vertex(c);
children.push_back(out);
}
return out;
}
void getTerminators(std::vector<Terminator> &out, size_t limit )
{
if (out.size() >= limit)
return;
out.insert(out.end(), terminators.begin(), terminators.end());;
for (Vertex* c: children) {
c->getTerminators(out, limit);
if (out.size() >= limit)
break;
}
}
Vertex *getChild(char c)
{
for (Vertex *p : children)
if (p->c == c)
return p;
return nullptr;
}
size_t memSize()const
{
size_t out = sizeof(*this) + terminators.size() * sizeof(terminators[0]);
for (Vertex*v : children)
out += sizeof(v) + v->memSize();
return out;
}
};
class Root : public Vertex
{
public:
Root():Vertex(0) { }
void appendString(const char *str, int id )
{
for (volatile size_t len = strlen(str), suffix = len; suffix-- > 0;)
{
Vertex* parent = this;
for (size_t pos = suffix; pos < len; pos++)
{
parent = parent->getOrCreateChildVertex(str[pos]);
}
parent->terminators.push_back(Terminator(id, (int)suffix));
}
}
void findSubstr(std::vector<Terminator> &out, const char *substr, size_t limit )
{
Vertex *parent = this;
for (size_t len = strlen( substr ), i = 0; i < len; i++)
{
parent = parent->getChild(substr[i]);
if (!parent)
return;
}
parent->getTerminators(out, limit);
}
};