1

I have to build some kind of a "string catalogue" out of large PDF documents for faster string/substring searches.

The mechanism should work like this: A PDF scanner scans the PDF document for strings and invokes a callback-method in my catalogue to index that string.

Now, what technique should be used to build such a catalogue? I have heard of: - Suffix trees - Generalized suffix trees - Suffix arrays

I am mainly tending to the generalized suffix trees. Am I right or wrong then? I guess "normal" suffix trees are only good for indexing a SINGLE string.

But what about suffix arrays? Are there generalized suffix arrays out there?

I found a lot of code in C/C++ for building a suffix tree out of a string, but none for building generalized suffix trees!

Hasib Samad
  • 1,081
  • 1
  • 20
  • 39
  • The generalized suffix tree can be builded by any algorithm for suffix tree. For string `s1, s2, ..., sn` construct new string as concatenation of `s1, s2, ..., sn` separated by characters `$1, $2, ...` which is not contained in the main alphabet : `S = s1$1s2$2...$n-1sn$n` – pogorskiy Nov 08 '12 at 11:01
  • The main drawback of the suffix tree is a large amount of memory required (10n - 20n bytes, where n is the length of the text). – pogorskiy Nov 08 '12 at 11:11
  • Thats exactly how I am doing it in the mean while. I am loading the whole content as a SINGLE string and create one tree out of it. It works, but I would like to do it the way the generalized suffix tree does it. – Hasib Samad Nov 08 '12 at 17:11

0 Answers0