I need to efficiently match an input string against a large set of previously inserted strings, all of known length N.
This problem can usually be tackled with radix trees (an example question here), but I have some particular properties which I believe make this problem different from what I've seen so far:
- Both the input and stored strings can contain wildcard characters
_
(no string - stored or input - can contain only wildcards). For example,a_c
would match_bc
, but not__b
. - The set changes, so it must be easy to insert/remove entries.
- The strings are not arbitrary; the allowed characters are different for each position in the string. For example, the first character might only be in
[a-c]
, while the second char could be in[a-z]
. This is known in advance and never changes. - I do not need to actually get the string matches back. Instead, I need the IDs of the matching strings. I mention this in case there's an efficient way to store the graph without representing all inserted strings.
My current solution is to store a 3-dimensional set of vectors, where the first dimension corresponds to the position of a char in the string, and the other with its value (including the wildcard character); the third dimension contains all IDs that match that particular position/value. To find the set of matches I compute the set_intersection
of all the IDs that I get by looking this matrix with the input string (similar to this proposed solution).
However, this solution is still not quite fast enough (it's my current bottleneck), and I was wondering whether there was a way to do better.