Parsing *.cpp file containing enum using boost::regex.

Question

I alredy parsed file and split content to enum or enum classes.

std::string sourceString = readFromFile(typesHDestination);
boost::smatch xResults;
std::string::const_iterator Start = sourceString.cbegin();
std::string::const_iterator End = sourceString.cend();

while (boost::regex_search(Start, End, xResults, boost::regex("(?<data_type>enum|enum\\s+class)\\s+(?<enum_name>\\w+)\\s*\{(?<content>[^\}]+?)\\s*\}\\s*")))
{
    std::cout << xResults["data_type"]
        << " " << xResults["enum_name"] << "\n{\n";

    std::string::const_iterator ContentStart = xResults["content"].begin();
    std::string::const_iterator ContentEnd = xResults["content"].end();
    boost::smatch xResultsInner;

    while (boost::regex_search(ContentStart, ContentEnd, xResultsInner, boost::regex("(?<name>\\w+)(?:(?:\\s*=\\s*(?<value>[^\,\\s]+)(?:(?:,)|(?:\\s*)))|(?:(?:\\s*)|(?:,)))")))
    {
        std::cout << xResultsInner["name"] << ": " << xResultsInner["value"] << std::endl;

        ContentStart = xResultsInner[0].second;
    }

    Start = xResults[0].second;
    std::cout << "}\n";
}

Its ok if enums are without comments.

I tried to add named group <comment> to save comments in enums, but failed every time. (\/{2}\s*.+) - sample for comments with double slashes.

I tested using this online regex and with boost::regex.

The first step - from *.cpp file to <data_type> <enum_name> <content> regex:

(?'data_type'enum|enum\s+class)\s+(?'enum_name'\w+)\s*{\s*(?'content'[^}]+?)\s*}\s*

From <content> to <name> <value> <comment> regex:

(?'name'\w+)(?:(?:\s*=\s*(?'value'[^\,\s/]+)(?:(?:,)|(?:\s*)))|(?:(?:\s*)|(?:,)))

The last one contains error. Is there any way to fix it and add feature to store coments in group?

Pure regex is not the correct approach for parsing source code — UnholySheep, Jul 27 '17 at 14:57
put the output here and do not use **images** for it, the server for images is blocked for some countries like mine. — Shakiba Moshiri, Jul 27 '17 at 14:57
This looks like a job for [libTooling](https://clang.llvm.org/docs/LibTooling.html), *not* regex. — Jesper Juhl, Jul 27 '17 at 15:20

Shakiba Moshiri · Answer 1 · 2017-07-27T16:22:38.037

As some comments said, may it is not a good idea to parse a source file with Regular Expression except with some simple cases

for example this source file, from: http://en.cppreference.com/w/cpp/language/enum

#include <iostream>

// enum that takes 16 bits
enum smallenum: int16_t
{
    a,
    b,
    c
};


// color may be red (value 0), yellow (value 1), green (value 20), or blue (value 21)
enum color
{
    red,
    yellow,
    green = 20,
    blue
};

// altitude may be altitude::high or altitude::low
enum class altitude: char
{ 
     high='h',
     low='l', // C++11 allows the extra comma
}; 

// the constant d is 0, the constant e is 1, the constant f is 3
enum
{
    d,
    e,
    f = e + 2
};

//enumeration types (both scoped and unscoped) can have overloaded operators
std::ostream& operator<<(std::ostream& os, color c)
{
    switch(c)
    {
        case red   : os << "red";    break;
        case yellow: os << "yellow"; break;
        case green : os << "green";  break;
        case blue  : os << "blue";   break;
        default    : os.setstate(std::ios_base::failbit);
    }
    return os;
}

std::ostream& operator<<(std::ostream& os, altitude al)
{
    return os << static_cast<char>(al);
}

int main()
{
    color col = red;
    altitude a;
    a = altitude::low;

    std::cout << "col = " << col << '\n'
              << "a = "   << a   << '\n'
              << "f = "   << f   << '\n';
}

the key pattern here is: starting with enum and end with ; and you cannot predict any text between enum and ; there will be so many possibilities! and for that you can use .*? lazy star

Thus if I want to extract all enums I use:

NOTE: it is not the efficient way

boost::regex rx( "^\\s*(enum.*?;)" );

boost::match_results< std::string::const_iterator > mr; // or boost::smatch


std::ifstream ifs( "file.cpp" );
const uintmax_t file_size = ifs.seekg( 0, std::ios_base::end ).tellg();
                            ifs.seekg( 0, std::ios_base::beg );   // rewind

std::string whole_file( file_size, ' ' );
ifs.read( &*whole_file.begin(), file_size );
ifs.close();

while( boost::regex_search( whole_file, mr, rx ) ){
    std::cout << mr.str( 1 ) << '\n';
    whole_file = mr.suffix().str();
}

which the output will be:

enum smallenum: int16_t
{
    a,
    b,
    c
};
enum color
{
    red,
    yellow,
    green = 20,
    blue
};
enum class altitude: char
{
     high='h',
     low='l', // C++11 allows the extra comma
};
enum
{
    d,
    e,
    f = e + 2
};

And Of course for such simple thing I prefer to use:

perl -lne '$/=unlef;print $1 while/^\s*(enum.*?;)/smg' file.cpp

that has the same output.

And may this pattern helps you if you want to match each section separately

`^\s(enum[^{])\s({)\s([^}]+)\s*(};)`

But again this is not a good idea except for some simple source files. Since C++ Source Code has free style and not all code writers follow the standard rules. For example with the pattern above, I assumed that (};) the } comes with ; and if someone separates them ( which is still a valid code ) the pattern will be failed to match.

score 0 · Accepted Answer · answered Jul 28 '17 at 13:50

I argree with the fact that using regex to parse complicated data is not the best solution. I'v made an omission of the few major conditions. First of all, i parsed some kind of generated source code containing emuns and enum classes. So there were no suprises in code, and code was regular. So i parsing regular code with regex.

The Answer: (the first step is the same, the second was fixed) How to parse enums/emun classes with regex:

The first step - from *.cpp file to <data_type> <enum_name> <content> regex:

(?'data_type'enum|enum\s+class)\s+(?'enum_name'\w+)\s*{\s*(?'content'[^}]+?)\s*}\s*

From <content> to <name> <value> <comment> regex:

^\s*(?'name'\w+)(?:(?:\s*=\s*(?'value'[^,\n/]+))|(?:[^,\s/]))(?:(?:\s$)|(?:\s*,\s*$)|(?:[^/]/{2}\s(?'comment'.*$)))

All test were ok and here is marked text by colors.

Parsing *.cpp file containing enum using boost::regex.

2 Answers2

^\s*(enum[^{]*)\s*({)\s*([^}]+)\s*(};)

`^\s(enum[^{])\s({)\s([^}]+)\s*(};)`