2

I am dealing with a collection of objects where the reasonable size of it could be anywhere between 1 and 50K (but there's no set upper limit). Each object contains a handful of strings.

I want to implement to a search function that can partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects.

If each object only contained a single string then I could simply lexicographically sort them, and pull out ranges fairly easily - but I am reluctant to implement a map-like structure for each of the contained strings due to speed/memory concerns.

Is there a data structure well suited to this kind of operation for speed and memory efficiency? I'm sensing a database maybe on the horizon, but I know little about them, so I want to hold off researching until someone more knowledgeable can nudge me in the right direction!

cmannett85
  • 21,725
  • 8
  • 76
  • 119

3 Answers3

1

a map-like collection is probably your best bet, the key will be the string, and the value is a reference to the containing object. If your strings are held inside the objects as a stl string, then you could store a reference to the data in the key part of the map instead (alternatively use a shared_ptr for the strings and reference them in both the object and the map)

Searching, sorting just becomes a matter of implementing a custom search functor that uses the dereferenced data. The size of the map will be 2 references plus the map overhead which isn't going to be that bad if you consider the alternatives will be as large, if not larger.

gbjbaanb
  • 51,617
  • 12
  • 104
  • 148
1

partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects

Well, for exact matches, you could have a std::map<std::string, std::vector<object*> >. The key would be the exact string, and the vector holds pointers to matching objects, many of these pointers may point to a single object instance.

You could have a front-end map from partial strings to full strings: say the string is "dogged", you'd sadly have to put entries in for "dogged", "ogged", "gged", "ged", "ed" and "d" (stop wherever you like if you want a minimum match size)... then use lower_bound to search. That way, say you search on "dog" you could still see that there was a match for "dogged" (doesn't matter if it matches say "dogfood" instead. This would be a simple std::map<string, string>. While you increment forwards from the lower_bound position and the string still matches (i.e. from dogfood to dogged to ... until it doesn't start with dog), you can search for that in the "exact match" map and aggregate results.

For regular expressions, I have no good suggestion... I'd start with a brute force search through all the full strings. If it really isn't good enough, then you do some rough optimisations like checking for a constant substring to filter by before doing the brute force matching, but it's beyond me to imagine how to do this very thoroughly and fast.

(substitute your favourite smart pointers for object*s if useful)

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
1

Thanks for all the replies, but following on from techniques mentioned in this post, I've decided to use an enhanced suffix array from the header-only SeqAn project.

Community
  • 1
  • 1
cmannett85
  • 21,725
  • 8
  • 76
  • 119