1

I am a beginner to regex in c++ I was wondering why this code:

#include <iostream>
#include <string>
#include <boost/regex.hpp>

int main() {

   std::string s = "? 8==2 : true ! false";
   boost::regex re("\\?\\s+(.*)\\s*:\\s*(.*)\\s*\\!\\s*(.*)");

   boost::sregex_token_iterator p(s.begin(), s.end(), re, -1);  // sequence and that reg exp
   boost::sregex_token_iterator end;    // Create an end-of-reg-exp
                                        // marker
   while (p != end)
      std::cout << *p++ << '\n';
}

Prints a empty string. I put the regex in regexTester and it matches the string correctly but here when I try to iterate over the matches it returns nothing.

sehe
  • 374,641
  • 47
  • 450
  • 633
Ashtyn
  • 29
  • 4

1 Answers1

0

I think the tokenizer is actually meant to split text by some delimiter, and the delimiter is not included. Compare with std::regex_token_iterator:

std::regex_token_iterator is a read-only LegacyForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).

Indeed you invoke exactly this mode as per the docs:

if submatch is -1, then enumerates all the text sequences that did not match the expression re (that is to performs field splitting).

(emphasis mine).

So, just fix that:

for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
     ++p)
{
    boost::sub_match<It> const& current = *p;
    if (current.matched) {
        std::cout << std::quoted(current.str()) << '\n';
    } else {
        std::cout << "non matching" << '\n';
    }
}

Other Observations

All the greedy Kleene-stars are recipe for trouble. You won't ever find a second match, because the first one's .* at the end will by definition gobble up all remaining input.

Instead, make them non-greedy (.*?) and or much more precise (like isolating some character set, or mandating non-space characters?).

boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");

// Or, if you don't want raw string literals:
boost::regex re("\\?\\s+(.*?)\\s*:\\s*(.*?)\\s*\\!\\s*(.*?)");

Live Demo

#include <boost/regex.hpp>
#include <iomanip>
#include <iostream>
#include <string>

int main() {
    using It = std::string::const_iterator;
    std::string const s = 
        "? 8==2 : true ! false;"
        "? 9==3 : 'book' ! 'library';";
    boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");

    {
        std::cout << "=== regex_search:\n";
        boost::smatch results;
        for (It b = s.begin(); boost::regex_search(b, s.end(), results, re); b = results[0].end()) {
            std::cout << results.str() << "\n";
            std::cout << "remain: " << std::quoted(std::string(results[0].second, s.end())) << "\n";
        }
    }

    std::cout << "=== token iteration:\n";
    for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
         ++p)
    {
        boost::sub_match<It> const& current = *p;
        if (current.matched) {
            std::cout << std::quoted(current.str()) << '\n';
        } else {
            std::cout << "non matching" << '\n';
        }
    }
}

Prints

=== regex_search:
? 8==2 : true ! 
remain: "false;? 9==3 : 'book' ! 'library';"
? 9==3 : 'book' ! 
remain: "'library';"
=== token iteration:
"? 8==2 : true ! "
"? 9==3 : 'book' ! "

BONUS: Parser Expressions

Instead of abusing regexen to do parsing, you could generate a parser, e.g. using Boost Spirit:

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <boost/fusion/adapted.hpp>
#include <iomanip>
#include <iostream>
namespace x3 = boost::spirit::x3;

int main() {
    std::string const s = 
        "? 8==2 : true ! false;"
        "? 9==3 : 'book' ! 'library';";

    using expression = std::string;
    using ternary = std::tuple<expression, expression, expression>;
    std::vector<ternary> parsed;

    auto expr_ = x3::lexeme [+(x3::graph - ';')];
    auto ternary_ = "?" >> expr_ >> ":" >> expr_ >> "!" >> expr_;

    std::cout << "=== parser approach:\n";
    if (x3::phrase_parse(begin(s), end(s), *x3::seek[ ternary_ ], x3::space, parsed)) {

        for (auto [cond, e1, e2] : parsed) {
            std::cout
                << " condition " << std::quoted(cond) << "\n"
                << " true expression " << std::quoted(e1) << "\n"
                << " else expression " << std::quoted(e2) << "\n"
                << "\n";
        }
    } else {
        std::cout << "non matching" << '\n';
    }
}

Prints

=== parser approach:
 condition "8==2"
 true expression "true"
 else expression "false"

 condition "9==3"
 true expression "'book'"
 else expression "'library'"

This is much more extensible, will easily support recursive grammars and will be able to synthesize a typed representation of your syntax tree, instead of just leaving you with scattered bits of string.

sehe
  • 374,641
  • 47
  • 450
  • 633
  • I would also advise to use a parser instead of regex to do this. Sadly I'm out of time for now so might add later – sehe Feb 07 '21 at 18:02
  • Added a parser approach http://coliru.stacked-crooked.com/a/8e08c020fd61cdfc – sehe Feb 07 '21 at 19:42
  • How would you go about removing the semicolons. I don't want there to be any semicolons. – Ashtyn Feb 07 '21 at 22:06
  • What do you mean? Remove them in the code, the grammar, the input or the output? – sehe Feb 07 '21 at 22:19
  • I created a chat here https://chat.stackoverflow.com/rooms/228385/brainstorm-parser-questions – sehe Feb 07 '21 at 22:25
  • In case you cannot join the chat, or have simply forgotten. I had this guess that might be helpful: http://coliru.stacked-crooked.com/a/c8e4bc52d11d09db removes the `;`. Note the changed `expr_` rule (to exclude "?:!" characters). – sehe Feb 10 '21 at 01:51
  • Receiving late accepts always makes me smile. Cheers :) Consider also ]voting](https://stackoverflow.com/help/privileges/vote-up) if you feel that an answer was useful. It may seem redundant as there's only one answer, but it it makes it clear whether the answer actually helped or was merely accepted as "okay, I guess I'll have to live with that". – sehe May 22 '21 at 13:35