1

I have a huge unique string list (1.000.000.000+ lines). I need to know if a string does exist in this list or not. What is the fastest way to do it ?

I guess I need a very simple database engine with a Btree index which lets me do fast lookup ... and MySQL may be too slow and complex for this.

Antares
  • 177
  • 2
  • 12

1 Answers1

2

If this is all you need to do, you should take a long look at tries and related data structures specialized for strings (e.g. suffix array). With this many strings, you are guaranteed to have a lot of overlap, and these data structures can eliminate such overlap (saving not only memory but also processing time).

  • This. What OP really needs is a trie, not a complete RDBMS or NoSQL solution. – DaSourcerer Dec 14 '13 at 21:49
  • Do you know of parallel implementations of such structures? With tens of gigabytes of strings, I imagine that parallelism would be a benefit. – Gordon Linoff Dec 14 '13 at 21:54
  • @GordonLinoff Depends on what you want to parallelize. It's trivial to run several read-only queries in parallel. Construction should be easy to parallelize: At each level, you bucket the strings according to their next letter, and then construction proceeds for each bucket independently. It doesn't appear to be possible to parallelize parts of one search, but since trie lookup is O(string length), this seems like a non-issue. –  Dec 14 '13 at 21:57