Iam extending a software tool to calculate metrics for software projects. The metrics are then used to do a static code analysis. My task is to implement the calculation of metrics for c and c++ projects.
In the developing process i encountered problems which led to reset and starting over again with a different tool or programming language. I will state the process, problems and things i tried to solve them in chronological order and as good as possible.
Some metrics:
- Lines of Code for Classes, Structs, Unions, Functions/Methods and Sourcefiles
- Method Count for Classes and Structs
- Complexity for Classes, Structs and Functions/Methods
- Dependencies for/between Classes and Structs
Since c++ is a difficult language to parse and writing a c++ parser on my own is out of scale i tend to use an existing c++ parser. Therefore i began using libraries from the LLVM Project to gather syntactic and semantic information about a source file.
LLVM Tooling link: https://clang.llvm.org/docs/Tooling.html
First i started with LibTooling written in c++ since it promised me "full controll" over the Abstract Syntax Tree (AST). I tried the RecursiveASTVistor and the Matchfinder approaches without success.
So LibTooling was dismissed because i couldnt retrieve context information about the surrounding of a node in the AST. I was only able to react on a callback when a specific node in the AST was visited. But i didnt know in what context i currently was. Eg. When I visit a C++RecordDeclaration (class, struct, union) i did not know if it is a nested record or not. But that information is needed to calculate the lines of code for a single class.
Second approach was using the LibClang interface via Python Bindings. With the LibClang interface i was able to traverse the AST node by node recursively and store needed context information on a stack. Here i encountered a general problem with LibClang:
Before creating the AST for a file the preprocessor is started and resolves all preprocessor directives. Just as he is supposed to do.
- This is good because if the preprocessor cant resolve all the include directives the output AST will be incomplete.
- This is very bad because i wont be able to provide all the include files or directories for any c++ project.
- This is bad because code which is surrounded by conditional preprocessor directives is not part of the AST if a preprocessor variable is defined or not. Parsing the same file multiple times with different setups of defined or undefined preprocessor variable is out of scope.
This lead to the third and current attempt with using a c++ parser generated by Antlr provided a c++14 grammar.
No preprocessor is executed before the parser. This is good because the full source code is parsed and preprocessor directives are being ignored. Bad thing is that the parser does not seem to be that tough. It fails on code which can be compiled leading to a broken AST. So this solution is not sufficient aswell.
My questions are:
- Is there an option to deactivate the preprocessor before parsing a c/c++ source or header file with libClang? So the source code is untouched and the AST is complete and detailed.
- Is there a way to parse a c/c++ source code file without providing all the necessary include directories but still resulting in a detailed AST?
- Since iam running out of options. What other approaches may be worth looking at when it comes to analysing/parsing c/c++ source code?
If you think this is not the right place to ask such questions feel free to redirect me to another place.