I have to build some kind of a "string catalogue" out of large PDF documents for faster string/substring searches.
The mechanism should work like this: A PDF scanner scans the PDF document for strings and invokes a callback-method in my catalogue to index that string.
Now, what technique should be used to build such a catalogue? I have heard of: - Suffix trees - Generalized suffix trees - Suffix arrays
I am mainly tending to the generalized suffix trees. Am I right or wrong then? I guess "normal" suffix trees are only good for indexing a SINGLE string.
But what about suffix arrays? Are there generalized suffix arrays out there?
I found a lot of code in C/C++ for building a suffix tree out of a string, but none for building generalized suffix trees!