5

Iam extending a software tool to calculate metrics for software projects. The metrics are then used to do a static code analysis. My task is to implement the calculation of metrics for c and c++ projects.

In the developing process i encountered problems which led to reset and starting over again with a different tool or programming language. I will state the process, problems and things i tried to solve them in chronological order and as good as possible.

Some metrics:

  • Lines of Code for Classes, Structs, Unions, Functions/Methods and Sourcefiles
  • Method Count for Classes and Structs
  • Complexity for Classes, Structs and Functions/Methods
  • Dependencies for/between Classes and Structs

Since c++ is a difficult language to parse and writing a c++ parser on my own is out of scale i tend to use an existing c++ parser. Therefore i began using libraries from the LLVM Project to gather syntactic and semantic information about a source file.

LLVM Tooling link: https://clang.llvm.org/docs/Tooling.html


First i started with LibTooling written in c++ since it promised me "full controll" over the Abstract Syntax Tree (AST). I tried the RecursiveASTVistor and the Matchfinder approaches without success.

So LibTooling was dismissed because i couldnt retrieve context information about the surrounding of a node in the AST. I was only able to react on a callback when a specific node in the AST was visited. But i didnt know in what context i currently was. Eg. When I visit a C++RecordDeclaration (class, struct, union) i did not know if it is a nested record or not. But that information is needed to calculate the lines of code for a single class.


Second approach was using the LibClang interface via Python Bindings. With the LibClang interface i was able to traverse the AST node by node recursively and store needed context information on a stack. Here i encountered a general problem with LibClang:

Before creating the AST for a file the preprocessor is started and resolves all preprocessor directives. Just as he is supposed to do.

  • This is good because if the preprocessor cant resolve all the include directives the output AST will be incomplete.
  • This is very bad because i wont be able to provide all the include files or directories for any c++ project.
  • This is bad because code which is surrounded by conditional preprocessor directives is not part of the AST if a preprocessor variable is defined or not. Parsing the same file multiple times with different setups of defined or undefined preprocessor variable is out of scope.

This lead to the third and current attempt with using a c++ parser generated by Antlr provided a c++14 grammar.

No preprocessor is executed before the parser. This is good because the full source code is parsed and preprocessor directives are being ignored. Bad thing is that the parser does not seem to be that tough. It fails on code which can be compiled leading to a broken AST. So this solution is not sufficient aswell.


My questions are:

  • Is there an option to deactivate the preprocessor before parsing a c/c++ source or header file with libClang? So the source code is untouched and the AST is complete and detailed.
  • Is there a way to parse a c/c++ source code file without providing all the necessary include directories but still resulting in a detailed AST?
  • Since iam running out of options. What other approaches may be worth looking at when it comes to analysing/parsing c/c++ source code?

If you think this is not the right place to ask such questions feel free to redirect me to another place.

Tarexx
  • 131
  • 1
  • 5
  • 1
    Why is it impossible for you to provide the correct include paths? You can't be "parsing the full source code" without running the preprocessor. It's impossible to build a correct AST for C++ without having seen the declarations of everything that the given piece of C++ refers to. It may be possible to get an OK approximation of the metrics you seek most of the time in practice. But to get that, you probably neither really need nor want to build an AST to begin with. Whatever you do, you'll almost certainly have to resort to heuristics to make up for all the information you don't have… – Michael Kenzel Mar 21 '19 at 13:58
  • @MichaelKenzel The workflow for analysing a project is that I get the root source code folder of a project without the includes of like third party files like eg. boost library files since these are not the code which was developed by the customer and is therefore not of interest for the static code analysis. So iam not able to provide the preprocessor with the needed includes. tl:dr i dont have these include files/directories. – Tarexx Mar 21 '19 at 14:27
  • I don't understand the comment regarding libTooling about not being able to `"retrieve context information about the surrounding of a node in the AST"`. You have the full AST (I think) so what context is missing? – G.M. Mar 21 '19 at 19:15
  • @G.M. With the surrounding of a node i mean the parent node or its child nodes. But the RecursiveASTVisitor only provides a callback when a node of a specified type is met while traversing the AST. So in my opinion iam not able to determine if the current node (node which led to the callback) is for example a class declaration within another class declaration. Because i cant tell in what order the callbacks will happen. Maybe my view on tree traversing is to limited. If iam not able to push and pop nodes on a stack to keep track of what was before the current node iam pretty lost. – Tarexx Mar 22 '19 at 07:46
  • This is an interesting question/project, but IMO far too broad for SO, so I lean towards closing it. Still, a note from me: in general you cannot parse C++ without preprocessing includes and macros. Many times macros contain a part of the source code and without resolving them you won't have valid C++ code. There's no way around running a preprocessor and hence no way to do what you want without the necessary include files. – Mike Lischke Mar 22 '19 at 08:10
  • @MikeLischke Thanks for your feedback on the question in general. I somehow expected that it wont be possible to parse c or c++ without resolving all the preprocessor directives and includes. Thanks for the clarification on that. About the question beeing to broad for SO. Do you know another place where my question would be more appropriate to ask? It is more like an engineering question on how to proceed or tackle a problem. Or would it be better to cut down the big question into smaller ones more suitable for SO and ask them here? This goes into meta discussion. Maybe i should ask 'there'... – Tarexx Mar 22 '19 at 08:28
  • Yes, splitting your problem into smaller pieces that can be answered with a few sentences is much prefered. Also, always include what you already tried to solve your problem. Many people otherwise assume you ask them to do your homework (and vote to close the question). Good luck and welcome on Stackoverflow, btw. :-) – Mike Lischke Mar 22 '19 at 13:52

1 Answers1

0

To answer your last question,

Since iam running out of options. What other approaches may be worth looking at when it comes to analysing/parsing c/c++ source code?

Another approach is to parse the source code as if it were merely text. This avoids the need to preprocess the source, and to bring in a complex parser. See this paper for an example/introduction: "The Conceptual Cohesion of Classes" by Andrian Marcus, Denys Poshyvanyk. You can still collect such information as LOC and number of methods from this approach, without needing a full parser.

This approach has drawbacks (as does any approach):

  • It either 1) parses comments along with the source code, or 2) requires that you remove comments from the source. But the latter is an easy step. The reason that might be OK is that even the comments contain information regarding the code, which may help determine which modules are more closely coupled, etc.
  • It will lump local variables, method names, parameter names, etc. all into the "bag of words" that you are working with.