2

I need to check a short string for matches with a list of substrings. Currently, I do this like shown below (working code on ideone)

bool ContainsMyWords(const std::wstring& input)
{
    if (std::wstring::npos != input.find(L"white"))
        return true;
    if (std::wstring::npos != input.find(L"black"))
        return true;
    if (std::wstring::npos != input.find(L"green"))
        return true;
    // ...
    return false;
}


int main() {
  std::wstring input1 = L"any text goes here";
  std::wstring input2 = L"any text goes here black";

  std::cout << "input1 " << ContainsMyWords(input1) << std::endl;
  std::cout << "input2 " << ContainsMyWords(input2) << std::endl;
  return 0;
}

I have 10-20 substrings that I need to match against an input. My goal is to optimize code for CPU utilization and reduce time complexity for an average case. I receive input strings at a rate of 10 Hz, with bursts to 10 kHz (which is what I am worried about).

There is agrep library with source code written in C, I wonder if there is a standard equivalent in C++. From a quick look, it may be a bit difficult (but doable) to integrate it with what I have.

Is there a better way to match an input string against a set of predefined substrings in C++?

oleksii
  • 35,458
  • 16
  • 93
  • 163
  • 1
    [This](http://coliru.stacked-crooked.com/a/738a7aaf8b4910ad) proably has the same performance but IMHO it is easier to read/maintain. – NathanOliver May 16 '17 at 14:02
  • 1
    Bug in code (forgot to compare against `std::string::npos`). [This](http://coliru.stacked-crooked.com/a/9eb2a6183836ccc9) example works. – NathanOliver May 16 '17 at 14:10
  • Possible duplicate of [Algorithm to find multiple string matches](http://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-string-matches) – knst May 16 '17 at 14:10
  • do you need to match `"textblack"` (i.e. no space)? – apple apple May 16 '17 at 15:38
  • @appleapple no, each word is delimited. `textblack` - should not be matched, `text black` - should be matched – oleksii May 16 '17 at 15:46

2 Answers2

1

The best thing is to use a regular expression search, if you use the following regular expression:

"(white)|(black)|(green)"

that way, with only one pass over the string, you'll get in group 1 if a match was found for the "white" substring (and beginning and end points), in group 2 if a match of the "black" substring (and beginning and end points), and in group 3 if a match of the "green" substring. As you get, from group 0 the position of the end of the match, you can begin a new search to look for more matches, and everything in one pass over the string!!!

Luis Colorado
  • 10,974
  • 1
  • 16
  • 31
0

You could use one big if, instead of several if statements. However, Nathan's Oliver solution with std::any_of is faster than that though, when making the array of the substrings static (so that they do not get to be recreated again and again), as shown below.

bool ContainsMyWordsNathan(const std::wstring& input)
{
    // do not forget to make the array static!
    static std::wstring keywords[] = {L"white",L"black",L"green", ...};
    return std::any_of(std::begin(keywords), std::end(keywords),
      [&](const std::wstring& str){return input.find(str) != std::string::npos;});
}

PS: As discussed in Algorithm to find multiple string matches:

The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • You are right, one big if works faster but there is nothing to do with branch predictor: one big if is a number of NESTED ifs, not sequence of ifs. – Andriy Berestovskyy May 16 '17 at 15:35
  • You are still comparing one big IF vs a sequence of IFs. "if (a||b) {}" is not the same as "if (a) {} ;if (b) {}" it is "if (a) {} else { if (b) {} }". Do you see the difference? – Andriy Berestovskyy May 16 '17 at 15:45
  • @gsamaras, Please give any hint why cannot I use std::wstring keywords[] as the argument? – MathArt Dec 01 '21 at 09:52