1

Suppose I want to create a regular expression that searches for two words in a string, but with the condition that it only matches them if there isn't one of several other words in between the two I'm looking for. For example:

string input {"Somebody has typed in some words here."}

I'm looking for the words somebody and words, but I only want the regular expression to match these if there isn't the word typed somewhere between them (with typed being only one of several words I don't want to stand between somebody and words). Which regular expression fulfills this? I've tried several approaches, but none of them worked as I intended. Can anybody please help me?

AlexM
  • 325
  • 4
  • 11

2 Answers2

2

I'd do it by avoiding using the regex, cause once you introduce a regex, Now you have 2 problems

Given:

  1. The beginning of our search range: const auto first = "Somebody"s
  2. The end of our search range: const auto second = "words"s
  3. The collection of words that shouldn't exist in the range: const vector<string> words = { "in"s }
  4. The input string: const auto input = "Somebody has typed in some words here."s

We can do this:

const auto start = input.find(first) + size(first);
const auto finish = input.find(second, start);

if (start != string::npos && finish != string::npos) {
    istringstream range(input.substr(start, finish - start));

    if (none_of(istream_iterator<string>(range), istream_iterator<string>(), [&](const auto& i) { return find(cbegin(words), cend(words), i) != cend(words); })) {
        cout << "match\n";
    } else {
        cout << "not a match\n";
    }
} else {
    cout << "not a match\n";
}

Live Example


If you're married to a regex though, there is a way that you can do this using a regex. For example if words contained: "in", "lorem", and "ipsum" you'd want something like:

\bSomebody\b(?:(\bin\b|\blorem\b|\bipsum\b).*|.)*?\bwords\b

Then we'd just need to test if our match contained anything:

const regex re("\\b" + first + accumulate(next(cbegin(words)), cend(words), "\\b(?:(\\b" + words.front(), [](const auto& lhs, const auto& rhs) { return lhs + "\\b|\\b" + rhs; }) + "\\b).*|.)*?\\b" + second + "\\b");
smatch sm;

if (regex_search(input, sm, re) && sm[1].length() == 0U) {
    cout << "match\n";
} else {
    cout << "not a match\n";
}

Live Example

Community
  • 1
  • 1
Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288
  • This looks good, though a bit hard for me as a newby to understand. Has this whole thing or parts of it a name so that I can do some more research on that in order to better understand it? – AlexM Dec 01 '16 at 13:30
  • Are you asking if the algorithm has a name? Naw, but everything I'm using is from the standard so you could look stuff up on http://en.cppreference.com and the live example is super helpful if you want to tinker with things. Could I answer a specific question? – Jonathan Mee Dec 01 '16 at 13:42
  • I first look up the things that are new for me and then I'll come back here to see if I understand everything, and if not, I will pose the question here.. Thank you! – AlexM Dec 01 '16 at 13:54
  • I've looked over the second solution, but I'm still to preoccupied with the first one. I roughly understand what's going on there, but I don't understand why the condition of the second if looks like it does. I think this is the hardest part for me. – AlexM Dec 01 '16 at 14:46
  • @AlexM First off I'd warn that the regex version is not the fastest or best way to solve the problem. That said, it is a reasonably fast regex because it avoids look aheads/behinds. As far as an explanation, you'll find an exact one in the http://regex101.com description if you click on the regex. Basically I'm looking at every character between "Somebody" and "words". If I ever find a character beginning an element of `words` I'll immediately read to the end of the range. Again tinkering with the live example may be very helpful to you. – Jonathan Mee Dec 01 '16 at 14:58
  • I think I can understand the above lines, now that I have acquainted myself with the occuring functions. There is just one remaining thing I'm not sure about: is there actually a difference between stringstream and istringstream as in the example above? (and while we are at it: ostringstream?) – AlexM Dec 01 '16 at 19:12
  • @AlexM So [here](http://upload.cppreference.com/mwiki/images/0/06/std-io-complete-inheritance.svg) is a diagram of how the different streams relate to each other. The most direct way to say this is `ostringstream` inherits `ostream` functionality (the insertion operator for example) `istringstream` inherits `istream` functionality (the extraction operator for example) and `stringstream` inherits both. – Jonathan Mee Dec 01 '16 at 20:01
  • Is there any possibility to ignore the case of the words that I compare? With the regex, I simply wrote regex_constants::icase. Is there something similar for this case, too? – AlexM Dec 02 '16 at 15:08
  • @AlexM So there is: http://stackoverflow.com/q/37482246/2642059 but I'm thinking at that point the hassles of using the `istringstream` are starting to overcome the value. You'd need to use a `find_if` instead of a `find` inside your `none_of`, and I have a rule of thumb that nested ternaries or lambdas become too hard to read. So you can go ahead and tackle it if you like, but I'd say it's time to go with the `regex` :( – Jonathan Mee Dec 02 '16 at 16:14
  • A pity, I liked the other approach more, but never mind. So I tried to fully understand this very long regex from above and tested in in my IDE. Surprisingly, I keep getting a compiler error which says that neither cbegin nor cend were declared in this scope. I wonder how this can be because I've included of course. I tried to rewrite these with words.begin and words.end and the program run, but it gave me back a match in cases where a match is impossible. Only if I remove one or both of the boundary words the program gives back no match. What is going on here? – AlexM Dec 03 '16 at 10:02
  • So `cbegin` and `cend` are C++14. So you'll need to tell your compiler to build for that. For example, g++'s argument is "-std=gnu++14", you'll also need to `#include `. Now as bar as what's going on, could you paste your code into http://ideone.com and send me the link so that I can have a look? We know that the regex itself works from testing on http://regex101.com so there must be some sort of mistake in the translation to C++, let's see if we can find it. – Jonathan Mee Dec 05 '16 at 00:54
  • I've done the same thing for -std=gnu++11 and it worked fine (my compiler is MinGW GCC 4.9.2), but if I change it to -std=gnu++14, as for the warning that cend and cbegin are unknown, nothing changes. I included the header . Is my compiler wrong? My web search wasn't very illuminating. But of course I will send you the code as soon as the compiler doesn't complain anymore and by the way, at this point I want to thank you for your help. – AlexM Dec 05 '16 at 23:56
  • I think this is a good opportunity to ask another question. I'd suggest posting the minimal version of the code that you get the error from, along with the error that you're getting. If you'll post a link here I'll come have a look at it too. – Jonathan Mee Dec 06 '16 at 00:18
  • This is the important portion of the code (basically the one you've initially written): http://ideone.com/XzS3HI. If I compile it with ideone.com it works as it is supposed to. But if I compile it with the aforesaid compiler of mine (MinGW 4.9.2 with the argument -std=gnu++14), then I keep getting this error: [Error] 'cbegin' was not declared in this scope. Same thing for cend, but these are the only errors. I don't understand why the argument -std=gnu++14 is apparently not recognized. I know the meaning of something not being declared in a scope, but it doesn't make sense to me here. – AlexM Dec 06 '16 at 10:52
  • I need to correct my last answer: It seems that the argument was recognized, because another error I had forgotten to tell you about disappeared. It had to do with the const auto &lhs thing in the regex, using const auto in a lambda was only possible when using C++14, it said. So the argument was obviously recognized, but the other error remains. It seems to me it has to do with the code itself. – AlexM Dec 06 '16 at 11:44
  • @AlexM Hmmm... This would make a fine question if you can give your MinGW version and the error along with this code. It seems there's something non conforming about the compiler. You can change a couple things to see if you can get around the bug: 1) Initialize on one line `vector words = { "chair"s, "table"s }` 2) If that doesn't work use the pre-C++14 `words.cbegin()` and `words.cend()`. These shouldn't produce a change in the expected output, but perhaps the sidestep the bug in the compiler. – Jonathan Mee Dec 06 '16 at 12:15
  • What do you mean by the version of my MinGW? Isn't MinGW GCC 4.9.2 the version? As for the workarounds: I've already tried these and received another error: 20 187 [Error] no matching function for call to 'next(, , std::basic_string, main()::)' I'm not even sure if I understand what that means. – AlexM Dec 06 '16 at 13:04
  • @AlexM Haha, it means your missing a parenthesis. It should read: `next(cbegin(words)),`, your code reads: `next(cbegin(words),` – Jonathan Mee Dec 06 '16 at 13:08
  • Oh dear, what a dumb mistake of mine. Now the compiler doesn't complain anymore, however, like before, the whole thing doesn't do what it's supposed to do. Basically, it always gives me back a match except when I change one or both of the boundary words to words that don't show up in the input. I really have no idea anymore what to do to fix this. Just in case this could be important: My IDE is Dev C++ 5.8.2. – AlexM Dec 06 '16 at 13:38
  • @AlexM Let's look at your `regex`. Copy what you're putting into the regex constructor into a `cout` statement and paste the result here so we can look at what it's doing. – Jonathan Mee Dec 06 '16 at 13:50
  • Like this? ("\\b" + first + accumulate(next(words.cbegin()), words.cend(), "\\b(?:(\\b" + words.front(), [](const auto& lhs, const auto& rhs) { return lhs + "\\b|\\b" + rhs; }) + "\\b.*).*|.)*?" + second + "\\b", regex_constants::icase) This is what's inside the brackets of the regex without the final semicolon. If I don't print out the regex itself but what it's doing, I get back 1. – AlexM Dec 06 '16 at 14:05
  • What does this print out: `cout << "\\b" + first + accumulate(next(words.cbegin()), words.cend(), "\\b(?:(\\b" + words.front(), [](const auto& lhs, const auto& rhs) { return lhs + "\\b|\\b" + rhs; }) + "\\b.*).*|.)*?" + second + "\\b" << endl` – Jonathan Mee Dec 06 '16 at 14:20
  • This prints out the following: \bagain\b(?:(\btable\b|\bexample\b.*).*|.)*?be\b Note: I've chosen the boundary words to be again and be, so this accounts for the front and the back of the expression. In the vector words, there are the two words table and example. – AlexM Dec 06 '16 at 14:31
  • A couple points here: 1) after looking at your regex I realized I had a flaw in my answer: `second` needs to be preceded by a `\b`, I don't think this is your problem, but I've updated the answer for correctness 2) When I do post your exact regex online it seems to work as expected: https://regex101.com/r/dcjeRw/1 even when I change the `second` token. Could you give me an input which this regex returns a false positive on? It's possible that MinGW has a bug in it's `regex` code... – Jonathan Mee Dec 06 '16 at 14:41
  • That was also my observation as I compiled the little program this morning on ideone.com. It behaved as it should. Indeed, this little flaw didn't do any harm. Here is an input giving back a false positive: The sentence on which I applied it was "Again this is the example in which shall be searched" with the boundary words 'again' and 'be'. So all the words inbetween should return no match, but every single one of them did, no matter if I searched for only one or more. So for example, the vector with the two words "table" and "which" should return no match, but it matched. – AlexM Dec 06 '16 at 14:52
  • @AlexM In my answer I used "Match" to mean the string was valid. That is to say that it had the word `first` followed by the word `second` with none of the intervening words being elements of `words`, so "Match" should be printed in your example case, per your statement of the problem: "I want to create a regular expression that searches for two words in a string, but with the condition that it only matches them if there isn't one of several other words in between the two I'm looking for." Are we having a miscommunication? – Jonathan Mee Dec 06 '16 at 15:21
  • No, we're talking about the same problem. If I use for example 'this' and 'shall', both of which appear between the boundary words, then it still gives me back a match although here clearly no match was intended. – AlexM Dec 06 '16 at 15:30
  • @AlexM OK good, that would have been sad to have done all this only to be on completely different pages. If you can type the same regex into http://regex101.com with the same `input` and it works there but not in your program, you need to post a new question. (Make sure to be explicit about the environment you're working in, and to post your code, cause the downvoters are picky about that.) Then link the question here and I'll come work with you on it in that post. – Jonathan Mee Dec 06 '16 at 15:36
  • Before I open a new discussion, I want to try out compiling with another compiler. I've read about similar problems. I think I should use the latest version of MinGW which is 5.3 if I don't get it wrong. However, finding a download link has turned out to be not that easy. Can you think of a suitable compiler for windows and C++14? If the problem lies in the compiler, then using another one could fix it, because the regular expression itself seems to be flawless. – AlexM Dec 06 '16 at 16:59
  • @AlexM Your dipping into one of my personal soapbox issues, but the Community version of Visual Studio 2015 is completely free, with only minimal requirements if you're trying to make money off it: https://www.visualstudio.com/downloads/ I can't sing enough praise for Microsoft and their generosity. – Jonathan Mee Dec 06 '16 at 17:06
  • 1
    Finally, with the new IDE, everything s working as it should. Only today I've started using Visual Studio and already like it. So the flaw really seems to have been in MinGW (or the Dev C++ IDE? I'm not sure). As for the making money part: It isn't my goal to make money with my programming (how could it after just two months of training, haha), but because you've mentioned it explicitely, it rendered me curious: what exactly do you mean by minimal requirements? And once again: thank you for your extensive help! – AlexM Dec 07 '16 at 11:29
  • 1) This does seem like a bug with MinGW, you've already put a lot of effort into testing this so it may not be too much more for you to write a bug: http://www.mingw.org/Reporting_Bugs 2) [Visual Studio Community's Licenses Terms](https://www.visualstudio.com/license-terms/mt171547/) TLDR: Any *individual* developer can use Visual Studio Community to create their own free or paid apps. – Jonathan Mee Dec 07 '16 at 12:37
  • All right. Before I reptort the bug, I want to compile the program with the latest version of MinGW. I've read here on stackoverflow another thread also dealing with a problem involving a regex which works with the latest version of MinGW. I think if the latest version compiles it correctly, then reporting a bug would be superfluous, wouldn't it? As for the other things discussed, I think the problem has been fully solved now. You were really helpful, thank you for your patience with me as a newbie. – AlexM Dec 07 '16 at 15:35
  • @AlexM Yeah, if you compile and now it works then don't report a bug (but I'd still stick with Visual Studio 2015.) And you're absolutely welcome. C++ is an incredible language, and there's a ton of wisdom on this site to help you leverage that. I'm happy to have helped! – Jonathan Mee Dec 07 '16 at 16:00
0

Try this regex: (somebody)(?!.*(?:typed|nice)).*(words). It matches the first word followed by any number of whitespaces and the second word. Match will stop after somebody if followed by any number of characters and specific words. Group 1 matches somebody and group 2 matches words.

Nicolas
  • 6,611
  • 3
  • 29
  • 73
  • This wouldn't fit because (maybe I wasn't accurate enough in my question) it is the presence or absence of specific words at any position between the two words I'm searching for. Let's say I always want to match the words somebody and words unless there is either the word typed or the word nice somewhere between them (typed and nice are just random examples). So the expression should match the sentence "Somebody has written down some words" but not the sentence "Somebody has written down nice words" or the sentence "Somebody has typed in words". That's what I'm looking for. – AlexM Dec 01 '16 at 12:59