1

I have a requirement to build an automated system to parse a C++ .h file with a lot of #define statements in it and do something with the value that each #define works out to. The .h file has a lot of other junk in it besides the #define statements.

The objective is to create a key-value list, where the keys are all the keywords defined by the #define statements and the values are the evaluations of the macros which correspond to the definitions. The #defines define the keywords with a series of nested macros that ultimately resolve to compile-time integer constants. There are some that do not resolve to compile-time integer constants, and these must be skipped.

The .h file will evolve over time, so the tool cannot be a long hardcoded program which instantiates a variable to be equal to each keyword. I have no control over the contents of the .h file. The only guarantees are that it can be built with a standard C++ compiler, and that more #defines will be added but never removed. The macro formulas may change at any time.

The options I see for this are:

  1. Implement a partial (or hook into an existing) C++ compiler and intercept the value of the macros during the preprocessor step.
  2. Use regexes to dynamically build a source file which will consume all the macros currently defined, then compile and execute the source file to get the evaluated form of all the macros. Somehow (?) skip the macros which do not evaluate to compile-time integer constants. (Also, not sure if regex is expressive enough to capture all possible multi-line macro definitions)

Both of these approaches would add substantial complexity and fragility to the build process for this project which I would like to avoid. Is there a better way to evaluate all the #define macros in a C++ .h file?

Below is an example of what I am looking to parse:

#ifndef Constants_h
#define Constants_h

namespace Foo
{
#define MAKE_CONSTANT(A, B) (A | (B << 4))
#define MAGIC_NUMBER_BASE 40
#define MAGIC_NUMBER MAGIC_NUMBER_BASE + 0x2
#define MORE_MAGIC_1 345
#define MORE_MAGIC_2 65


    // Other stuff...


#define CONSTANT_1 MAKE_CONSTANT (MAGIC_NUMBER + 564, MORE_MAGIC_1 | MORE_MAGIC_2)
#define CONSTANT_2 MAKE_CONSTANT (MAGIC_NUMBER - 84, MORE_MAGIC_1 & MORE_MAGIC_2 ^ 0xA)
    // etc...

#define SKIP_CONSTANT "What?"

    // More CONSTANT_N mixed with more other stuff and constants which do
    // not resolve to compile-time integers and must be skipped


}

#endif Constants_h

What I need to get out of this is the names and evaluations of all the defines which resolve to compile-time integer constants. In this case, for the defines shown it would be

MAGIC_NUMBER_BASE 40
MAGIC_NUMBER 42
MORE_MAGIC_1 345
MORE_MAGIC_2 65
CONSTANT_1 1887
CONSTANT_2 -42

It doesn't really matter what format this output is in as long as I can work with it as a list of key-value pairs further down the pipe.

Techrocket9
  • 2,026
  • 3
  • 22
  • 33
  • 2
    Just use an existing C preprocessor to help you out. The regular GNU `cpp` with the `-dU` option should get you quite near to the result you are after. – Matteo Italia Mar 16 '17 at 20:57
  • Why are `CONSTANT_1` and `CONSTANT_2` in the output but `MAGIC_NUMBER_BASE`, `MAGIC_NUMBER`, `MORE_MAGIC_1`, `MORE_MAGIC_2` aren't? It appears they meet your criteria (defines which resolve to compile-time integer constants) at least as well as the other two. – Ben Voigt Mar 16 '17 at 20:57
  • @BenVoigt They should be in the output, I'll fix it now. – Techrocket9 Mar 16 '17 at 20:59
  • @MatteoItalia I'm not getting any of the output values expected when I use the -dU flag on cpp. Is there another flag I need? – Techrocket9 Mar 16 '17 at 21:03
  • Uh sorry, it outputs only macros that are used. You'd need something like `-dM` but without the predefined macros (`-dD` does that, but also prints the processed output). – Matteo Italia Mar 16 '17 at 21:08
  • @MatteoItalia I could screen out the predefined macros, but even with -dM it's not expanding the #defines for CONSTANT_1 and 2. It's just outputting the whole #define line with the MAKE_CONSTANT and everything. – Techrocket9 Mar 16 '17 at 21:11
  • Yep, that is unfortunate, I remembered those options to be a bit smarter. – Matteo Italia Mar 16 '17 at 21:17
  • 2
    @Techrocket9: The issue is that the rules of the preprocessor don't do what you want at all. Macros used by macros get expanded when the outer macro gets used, not where it gets defined. So `CONSTANT_1` is not a compile-time integral expression until it gets used... and it's possible that some uses are and some are not. – Ben Voigt Mar 17 '17 at 05:57
  • @BenVoigt, Ok, so tweak the definition to "resolves to a compile-time integer constant when used in an expression of the form 'int n = CONSTANT_N;'" – Techrocket9 Mar 17 '17 at 06:01
  • *so the tool cannot be a long hardcoded program which instantiates a variable to be equal to each keyword* I don't see why that is such a bad idea. – R Sahu Mar 21 '17 at 15:05
  • @RSahu Because the hardcoded list will quickly become outdated when the .h file changes. If the list is dynamically generated then this approach is fine (that's more or less option 2 above), but it the parameters of this project don't allow for human intervention when the .h file changes. – Techrocket9 Mar 21 '17 at 18:51
  • @Techrocket9, in theory that's a problem. In practice, it might not be. After all, you don't keep adding macros to a header file on a regular basis. – R Sahu Mar 21 '17 at 19:02
  • @RSahu Before making this post I submitted that solution to the project team and it was rejected due to requiring a human to update the hardcoded file. I need a fully automated solution. – Techrocket9 Mar 21 '17 at 19:07
  • @Techrocket9, Fair enough. Hope you are able to find a solution. Best of luck. – R Sahu Mar 21 '17 at 19:09
  • The options you describe seem (in combination) to be part of a workable solution to your question as asked. I wouldn't worry about those options making the build process more complex or fragile - the requirement to do such a thing does that, all on its own. My concern is that this question is an example of the XY problem (need to do X, someone decides it is necessary to do Y to achieve X, question appears about how to do Y, nobody can provide a worthwhile answer or offer useful alternatives because there is no actual mention of X in the question). – Peter Mar 24 '17 at 07:23
  • @Peter The X is populate and update a DB lookup table so that when we get a record coming in with a numeric ID it can be joined with data in the DB provided by other teams which is only associated with the name. The convention is that the name is a #define in this header file with the definition resolving to the ID. The only authoritative source in the company of the name-ID pairings is this header file, which is used and modified by multiple teams around the world. – Techrocket9 Mar 24 '17 at 09:32
  • Then I'd argue you need to change approach. Allow the teams to populate the DB, and have a program that generates the header file from the names in the DB. In a makefile all you need to do is set up a dependency between the header file and the database so, if the database is changed, the header is regenerated. With other dependencies set appropriately, all objects that depend on the header would be rebuilt. – Peter Mar 24 '17 at 10:16
  • @Peter I agree with you that the way it's done is nowhere close to ideal, and so does the team which owns the header file. However, when we asked them to consider alternative approaches to this data authority problem we were told that they would like to fix it, but they have too many higher priority items and it's on the technical debt backlog with no execution date in sight. Thus, my team is stuck with this unorthodox parsing problem. – Techrocket9 Mar 24 '17 at 10:35
  • Boost Wave is a preprocessor designed for embedding into other software so if you don't want to invoke a compiler to do this you could use Boost Wave to do it as a standalone program. – Jerry Jeremiah Mar 27 '17 at 02:07

3 Answers3

3

An approach could be to write a "program generator" that generates a program (the printDefines program) comprising statements like std::cout << "MAGIC_NUMBER" << " " << (MAGIC_NUMBER_BASE + 0x2) << std::endl;. Obviously, executing such statements will resolve the respective macros and print out their values.

The list of macros in a header file can be obtained by g++ with an -dM -E' option. Feeding this "program generator" with such a list of #defines will generate a "printDefines.cpp" with all the requiredcout`-statements. Compiling and executing the generated printDefines program then yields the final output. It will resolve all the macros, including those that by itself use other macros.

See the following shell script and the following program generator code that together implement this approach:

Script printing the values of #define-statements in "someHeaderfile.h":

#  printDefines.sh
g++ -std=c++11 -dM -E someHeaderfile.h > defines.txt
./generateDefinesCpp someHeaderfile.h defines.txt > defines.cpp
g++ -std=c++11 -o defines.o defines.cpp
./defines.o

Code of program generator "generateDefinesCpp":

#include <stdio.h>
#include <string>
#include <iostream>
#include <fstream>
#include <cstring>

using std::cout;
using std::endl;

/*
 * Argument 1: name of the headerfile to scan
 * Argument 2: name of the cpp-file to generate
 * Note: will crash if parameters are not provided.
 */
int main(int argc, char* argv[])
{
    cout << "#include<iostream>" << endl;
    cout << "#include<stdio.h>" << endl;
    cout << "#include \"" << argv[1] << "\"" << endl;
    cout << "int main() {" << endl;
    std::ifstream headerFile(argv[2], std::ios::in);
    std::string buffer;
    char macroName[1000];
    int macroValuePos;
    while (getline(headerFile,buffer)) {
        const char *bufferCStr = buffer.c_str();
        if (sscanf(bufferCStr, "#define %s %n", macroName, &macroValuePos) == 1) {
            const char* macroValue = bufferCStr+macroValuePos;
            if (macroName[0] != '_' && strchr(macroName, '(') == NULL  && *macroValue) {
                cout << "std::cout << \"" << macroName << "\" << \" \" << (" << macroValue << ") << std::endl;" << std::endl;
            }
        }
    }
    cout << "return 0; }" << endl;

    return 0;
}

The approach could be optimised such that the intermediate files defines.txt and defines.cpp are not necessary; For demonstration purpose, however, they are helpful. When applied to your header file, the content of defines.txt and defines.cpp will be as follows:

defines.txt:

#define CONSTANT_1 MAKE_CONSTANT (MAGIC_NUMBER + 564, MORE_MAGIC_1 | MORE_MAGIC_2)
#define CONSTANT_2 MAKE_CONSTANT (MAGIC_NUMBER - 84, MORE_MAGIC_1 & MORE_MAGIC_2 ^ 0xA)
#define Constants_h 
#define MAGIC_NUMBER MAGIC_NUMBER_BASE + 0x2
#define MAGIC_NUMBER_BASE 40
#define MAKE_CONSTANT(A,B) (A | (B << 4))
#define MORE_MAGIC_1 345
#define MORE_MAGIC_2 65
#define OBJC_NEW_PROPERTIES 1
#define SKIP_CONSTANT "What?"
#define _LP64 1
#define __APPLE_CC__ 6000
#define __APPLE__ 1
#define __ATOMIC_ACQUIRE 2
#define __ATOMIC_ACQ_REL 4
...

defines.cpp:

#include<iostream>
#include<stdio.h>
#include "someHeaderfile.h"
int main() {
std::cout << "CONSTANT_1" << " " << (MAKE_CONSTANT (MAGIC_NUMBER + 564, MORE_MAGIC_1 | MORE_MAGIC_2)) << std::endl;
std::cout << "CONSTANT_2" << " " << (MAKE_CONSTANT (MAGIC_NUMBER - 84, MORE_MAGIC_1 & MORE_MAGIC_2 ^ 0xA)) << std::endl;
std::cout << "MAGIC_NUMBER" << " " << (MAGIC_NUMBER_BASE + 0x2) << std::endl;
std::cout << "MAGIC_NUMBER_BASE" << " " << (40) << std::endl;
std::cout << "MORE_MAGIC_1" << " " << (345) << std::endl;
std::cout << "MORE_MAGIC_2" << " " << (65) << std::endl;
std::cout << "OBJC_NEW_PROPERTIES" << " " << (1) << std::endl;
std::cout << "SKIP_CONSTANT" << " " << ("What?") << std::endl;
return 0; }

And the output of executing defines.o is then:

CONSTANT_1 1887
CONSTANT_2 -9
MAGIC_NUMBER 42
MAGIC_NUMBER_BASE 40
MORE_MAGIC_1 345
MORE_MAGIC_2 65
OBJC_NEW_PROPERTIES 1
SKIP_CONSTANT What?
Techrocket9
  • 2,026
  • 3
  • 22
  • 33
Stephan Lechner
  • 34,891
  • 4
  • 35
  • 58
  • This is pretty close to the implementation I've been working on in powershell and MSVC. Is there a substantial advantage to printing macroValue in each print statement instead of just macroName again? – Techrocket9 Mar 27 '17 at 21:39
  • No substantial advantage; just a little bit more documentation in the cpp file. – Stephan Lechner Mar 28 '17 at 03:12
0

Here is a concept, based on assumptions from a clarification comment.

  • only one header
  • no includes
  • no dependency on the including code file
  • no dependency on previously included headers
  • no dependency on include order

Main requirements otherwise:

  • do not risk influence on binary build process (being the part which makes the actual software product)
  • do not try to emulate the binary build compiler/parser

How to:

  • make a copy
  • include it from a dedicated code file,
    which only contains "#include "copy.h";
    or directly preprocess the header
    (this just feels weirdly against my habits)
  • delete everything except preprocessor and pragmas, paying attention to line-continuation
  • replace all "#define"s by "HaPoDefine", except one (e.g. the first)
  • repeat
    • preprocess the including code file (most compiler have a switch to do this)
    • save the output
    • turn another "HaPoDefine" back into "#define"
  • until no "HaPoDefine" is left
  • harvest all macro expansions from the deltas of intermediate saves
  • discard everything which is not of relevance
  • since the final actual numerical value is most likely a result of the compiler (not the preprocessor), use a tool like bashs "expr" to calculate values for human-eye readability,
    be careful not to risk differences to binary build process
  • use some regex magic to achieve any desired format
Yunnosch
  • 26,130
  • 9
  • 42
  • 54
  • To clarify: At least for now, it's a single header file that I have to parse, so I don't need to worry about includes. The .h file is used and modified by many teams using multiple compilers, (GCC and MSVC that I know about) and should remain C++/14 complaint. – Techrocket9 Mar 22 '17 at 01:14
  • The file only needs to remain compliant for the binary build process. For the docu build process, and only temporarily, it can go through iterations of totally individual content. The manipulated versions could also have a different name and/or extension. Actually the extension ".i" is very likely. It being only one file makes things just easier. But be careful with thinking of "one header". The content of a header is always seen from point of view of a code file being compiled. And the content can differ between code files (e.g. AUTOSAR concepts). Lucky you if your systems are simpler. – Yunnosch Mar 22 '17 at 01:47
0

Can you use g++ or gcc with the -E option, and work with that output?

-E Stop after the preprocessing stage; do not run the compiler proper. The output is in the form of preprocessed source code, which is sent to the standard output. Input files which don't require preprocessing are ignored.

With this, I imagine:

  1. Create the list of all #define keys from the source
  2. Run the appropriate command below against the source file(s), and let the GNU preprocessor do its thing
  3. Grab the preprocessed result from stdout, filter to take only those in integer form, and output it to however you want to represent key/value pairs

One of these two commands:

gcc -E myFile.c
g++ -E myFile.cpp

https://gcc.gnu.org/onlinedocs/gcc-2.95.2/gcc_2.html https://gcc.gnu.org/onlinedocs/cpp/Preprocessor-Output.html

kmiklas
  • 13,085
  • 22
  • 67
  • 103