Boost regex is not matching the same as several regex websites

Question

I am attempting to parse a string using regex, so that when I iterate over its matches, it will give me only the results. My goal is to find all

#include <stuff.h>
#include "stuff.h"

while ignoring them if they are part of a comment block such as

/*
     #include "stuff.h"
*/

Here is my function to read a file, convert it to string, and parse the string, creating tokens which are then iterated over to print them all. the tokes would contain stuff.h , stuff.h based on the previous lines.

The problem that I ran into was using this regex https://regex101.com/r/tQFDr4/2

The question is, is my regex wrong or is it something in the function?

void header_check::filename(const boost::filesystem::directory_iterator& itr)  //function takes directory path                     
{                                                                                                   
    std::string delimeter ("#include.+(?:<|\\\")(.+)(?:>|\\\")(?![^*\\/]* (?:\\*+(?!\\/)[^*\\/]*|\\/+(?!\\*)[^*\\/]*)*\\*\\/)");//regex storage                                                                      
    boost::regex regx(delimeter,boost::regex::perl);//set up regex                                                  
    boost::smatch match;                                                                              
    std::ifstream file (itr->path().string().c_str());//stream to transfer to stream
    std::string content((std::istreambuf_iterator<char>(file)),    
    std::istreambuf_iterator<char>());//string to be parsed
    boost::sregex_token_iterator iter (content.begin(),content.end(), regx, 0);    //creates a match for each search
    boost::sregex_token_iterator end;                                                                 
    for (int attempt =1; iter != end; ++iter) {                                                       
        std::cout<< *iter<<" include #"<<attempt++<<"\n";  //prints results                                             
    }                                                       
}

score 1 · Accepted Answer · answered Apr 24 '17 at 00:32

First off, you have an extra space character in the regex.

But the real problem is you're treating the whole input as a single line. If you set that flag:

you will find that regex101 shows the same results.

In regex, all open quantifiers are greedy by default. As such, you must be a lot more specific. At the very start you have

#include.+

This is already the end of it, since .+ simply matches all of the content (up to and including the last line). Your only reprieve is that backtracking will occur so that at least 1 "tail" of the regex matches, but all the rest is "souped up" in between. Because .+ literally asks for 1 or as many as possible of any character!

Attempted Fixes...

make .+ be \s+ or so. In fact, it needs to be \s* because #include<iostream> is perfectly valid C++
next, you cannot match like you did because you'd happily match #include <iostream" or #include "iostream>. And again, .* needs to be limited. In this case, you can make the closing delimiter completely deterministic (because the opening delimiter completely predicts it), so you can use non-greedy Kleene-star:
```
#include\s*("(.*?)"|<(.*?)>)
```

HOWEVER

The real problem is that you're trying to parse a full on grammar with ... regexen¹.

All I can say is

Could you not?!

Here's a suggestion using Boost Spirit:

auto comment_ = space 
              | "//" >> *(char_ - eol) 
              | "/*" >> *(char_ - "*/")
              ;

Woah. That's a breath of fresh air. It's almost like programming, instead of wizardry and crossing your thumbs!

Now for the real meat:

auto include_ = "#include" >> (
        '<' >> *~char_('>') >> '>'
      | '"' >> *~char_('"') >> '"'
      );

And of course you want have the proof of the pudding too:

std::string header;
bool ok = phrase_parse(content.begin(), content.end(), seek[include_], comment_, header);

std::cout << "matched: " << std::boolalpha << ok << ": " << header << "\n";

This parses a single header and prints: Live On Coliru

matched: true: iostream

Would it be a piece of cake to scale up to all the non-commented includes?

std::vector<std::string> headers;
bool ok = phrase_parse(content.begin(), content.end(), *seek[include_], comment_, headers);

Oops. Two bugs. Firstly, we should not be matching our grammar. The best way would be to ensure we are at start of line, but that complicates the grammar. For now, let's disallow names spanning multiple lines:

auto name_ = rule<struct _, std::string> {} = lexeme[
      '<' >> *(char_ - '>' - eol) >> '>'
    | '"' >> *(char_ - '"' - eol) >> '"'
];

auto include_ = "#include" >> name_;

That helps a bit. The other bug is actually tougher, and I think it's a library bug. The problem is that it sees all the includes as active? It turns out that seek does not correctly use the skipper after the first match.² For now, let's work around it:

bool ok = phrase_parse(content.begin(), content.end(), *(omit[*(char_ - include_)] >> include_) , comment_, headers);

It does take away from the elegance a bit, but it does work:

The Full Monty

The full demo Live On Coliru

// #include <boost/graph/adjacency_list.hpp>

#include "iostream"

#include<fstream> /*
#include <boost/filesystem.hpp>
#include <boost/regex.hpp> */ //
#include <boost/spirit/home/x3.hpp>


void filename(std::string const& fname)  //function takes directory path                     
{                                                                                                   
    using namespace boost::spirit::x3;

    auto comment_ = space 
          | "//" >> *(char_ - eol) 
          | "/*" >> *(char_ - "*/")
          ;

    auto name_ = rule<struct _, std::string> {} = lexeme[
          '<' >> *(char_ - '>' - eol) >> '>'
        | '"' >> *(char_ - '"' - eol) >> '"'
    ];

    auto include_ = "#include" >> name_;

    auto const content = [&]() -> std::string {
        std::ifstream file(fname);
        return { std::istreambuf_iterator<char>{file}, {} };//string to be parsed
    }();

    std::vector<std::string> headers;
    /*bool ok = */phrase_parse(content.begin(), content.end(), *(omit[*(char_ - include_)] >> include_) , comment_, headers);

    std::cout << "matched: " << headers.size() << " active includes:\n";
    for (auto& header : headers)
        std::cout << " - " << header << "\n";
}

int main() {
    filename("main.cpp");
}

Printing

matched: 3 active includes:
 - iostream
 - fstream
 - boost/spirit/home/x3.hpp

¹ And it's not in Perl6, in which case you could be forgiven.

² I'll try to fix/report this tomorrow

I notice you tagged [tag:c++11]. Here's the trivial adaptation to [c++11 using Qi](http://coliru.stacked-crooked.com/a/7e36199ed576ea12). (On a side-note, ironically, `repository::qi::seek[]` [appears to have the same skipper bug](http://coliru.stacked-crooked.com/a/dd2f64a03cf3b7a2)). — sehe, Apr 24 '17 at 00:43
THANK YOU SO MUCH. IM SCREAMING BECAUSE I CAN'T UPVOTE YOUR ANSWER YET. — Anton, Apr 24 '17 at 04:14
Take your time, you can always accept once you find out whether it solves your problem (see also https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) — sehe, Apr 24 '17 at 06:24

Boost regex is not matching the same as several regex websites

1 Answers1

Attempted Fixes...

HOWEVER

The Full Monty