1

I need to efficiently match an input string against a large set of previously inserted strings, all of known length N.

This problem can usually be tackled with radix trees (an example question here), but I have some particular properties which I believe make this problem different from what I've seen so far:

  • Both the input and stored strings can contain wildcard characters _ (no string - stored or input - can contain only wildcards). For example, a_c would match _bc, but not __b.
  • The set changes, so it must be easy to insert/remove entries.
  • The strings are not arbitrary; the allowed characters are different for each position in the string. For example, the first character might only be in [a-c], while the second char could be in [a-z]. This is known in advance and never changes.
  • I do not need to actually get the string matches back. Instead, I need the IDs of the matching strings. I mention this in case there's an efficient way to store the graph without representing all inserted strings.

My current solution is to store a 3-dimensional set of vectors, where the first dimension corresponds to the position of a char in the string, and the other with its value (including the wildcard character); the third dimension contains all IDs that match that particular position/value. To find the set of matches I compute the set_intersection of all the IDs that I get by looking this matrix with the input string (similar to this proposed solution).

However, this solution is still not quite fast enough (it's my current bottleneck), and I was wondering whether there was a way to do better.

Svalorzen
  • 5,353
  • 3
  • 30
  • 54

0 Answers0