0

I have a collection of around 50 millions strings, each has around 100 characters. I am looking for very efficient (running time and memory usage) generalized suffix tree implementation.

I have tried https://github.com/npgall/concurrent-trees but it takes huge amount of memory eventhough the running time is efficient. With 2.5 million strings of length 100. It took like 50GB of memory already.

Benjamin Nguyen
  • 147
  • 1
  • 9
  • The things we actually do with data sets like that are not done with generalized suffix trees. If you explain what you need to be able to do with these 50M strings, i.e., what kinds of queries you need to be able to answer, we might be able to suggest a practical data structure. Also say what kind of data it is and how much RAM you have available – Matt Timmermans Nov 09 '15 at 05:08
  • for a given substring, i would like to retrieve all the occurrences: the strings containing the substring and the positions of the occurrences. I would like to squeeze till less than 16GB or 8GB. There might be a time and space trade up here. If so, I prefer the query time and the building time won't decline that bad (with a log factor is fine) – Benjamin Nguyen Nov 10 '15 at 15:28
  • Please read [What topics can I ask about here?](http://stackoverflow.com/help/on-topic), [What types of questions should I avoid asking?](http://stackoverflow.com/help/dont-ask) and [How do I ask a good question?](http://stackoverflow.com/help/how-to-ask) before attempting to ask more questions. An excessive number of poorly received questions that are off-topic will get you banned from asking questions, and you do not want that do you? –  Mar 03 '16 at 17:56

1 Answers1

0

Not an ideal solution, but you could use enter link description here. It has a CritBit1D version, were you can store arbitrary length keys.

Disadvantage #1: You would have to convert your strings to long[] first, ie. 4-8 characters per long.

Disadvantage #2: If you need a concurrent version, you would have to look at the Critbit64COW, which uses copy-on-write concurrency. However, this is not implemented for the Critbit1D yet, so you would need to do that yourself, using Critbit64COW as a template.

However, you could simply store only a 64bit hashcode as key, then you could use the CritBit64 (single-threaded) or CritBit64COW (multithreaded). Btw, reading concurrently is not a problem, even with CritBit64.

Disclaimer: I'm the author of CritBit.

TilmannZ
  • 1,784
  • 11
  • 18