3

I have a lot of strings (maybe about 50k-1M, all of them not too long, maybe 1-20 chars). Now I get any RegExp and I need to return a list/iterator of all matching strings. This must be as fast as possible.

What are good index structures to do that?

Currently, I'm building a tree on the chars of the strings. And I convert the RegExp into a deterministic automaton. And then I calculate the intersection of that automaton with the tree. That looks like a fast approach but I wonder about other possibilities.

An extra challenge is to support Unicode/UTF8, but I don't want to concentrate this question on that bit for now.

Albert
  • 65,406
  • 61
  • 242
  • 386
  • What OS and Language are you trying/can to use? – rob May 09 '14 at 08:31
  • @rob: I'm mostly asking for the algorithm / data structure here, so that shouldn't matter. However, I'm coding that in C++. – Albert May 09 '14 at 09:45
  • It depends on how often the list of strings need to be looked up and how often they need to be added to. At a previous job they had a project that was similar and the strings were changed rarely and read often. For that case a graph of the characters was used and traversed to reach the appropriate endpoints. It resulted in a considerable speedup on the old system (I don't remember the specific graph structure, all I remember is they were using the `c++` boost library. good luck. – Mike H-R May 09 '14 at 10:51
  • @MikeH-R: They are iteratively added and adding a new one to the list should be possible in a reasonable time, however, it is much more important that the query is fast. So I guess similarly to your conditions. – Albert May 09 '14 at 11:34

1 Answers1

0

I just found the codesearch project which seems to have implemented just that. The explanation is here: Regular Expression Matching with a Trigram Index.

Another related article might be this: Regular Expression Matching Can Be Simple And Fast

(I haven't really investigated it further. I will extend this answer later.)

Albert
  • 65,406
  • 61
  • 242
  • 386