I have a lot of strings (maybe about 50k-1M, all of them not too long, maybe 1-20 chars). Now I get any RegExp and I need to return a list/iterator of all matching strings. This must be as fast as possible.
What are good index structures to do that?
Currently, I'm building a tree on the chars of the strings. And I convert the RegExp into a deterministic automaton. And then I calculate the intersection of that automaton with the tree. That looks like a fast approach but I wonder about other possibilities.
An extra challenge is to support Unicode/UTF8, but I don't want to concentrate this question on that bit for now.