Instrumenting C/C++ code using LLVM

Question

I want to write a LLVM pass to instrument every memory access. Here is what I am trying to do.

Given any C/C++ program (like the one given below), I am trying to insert calls to some function, before and after every instruction that reads/writes to/from memory. For example consider the below C++ program (Account.cpp)

#include <stdio.h>

class Account {
int balance;

public:
Account(int b)
{
   balance = b;   
}
~Account(){ }

int read() 
{
  int r;  
  r = balance;   
  return r;
}

void deposit(int n) 
{   
  balance = balance + n;   
}

void withdraw(int n) 
{
  int r = read();   
  balance = r - n;   
}
};

int main ()
{ 
  Account* a = new Account(10); 
  a->deposit(1);
  a->withdraw(2);  
  delete a; 
}

So after the instrumentation my program should look like :

#include <stdio.h>

class Account 
{
  int balance;

public:
Account(int b)
{
  balance = b;   
}
~Account(){ }

int read() 
{
  int r;  
  foo();
  r = balance;
  foo();   
  return r;
}

void deposit(int n) 
{ 
  foo(); 
  balance = balance + n;
  foo();   
}

void withdraw(int n) 
{
  foo();
  int r = read();
  foo();
  foo();   
  balance = r - n;
  foo();   
}
};

int main ()
{ 
  Account* a = new Account(10); 
  a->deposit(1);
  a->withdraw(2);  
  delete a; 
}

where foo() may be any function like get the current system time or increment a counter .. so on.

Please give me examples (source code, tutorials etc) and steps on how to run it. I have read the tutorial on how make a LLVM Pass given on http://llvm.org/docs/WritingAnLLVMPass.html, but couldn't figure out how write a pass for the above problem.

Well, you could probably overload the operator to not only perform the actually addition, subtraction, assignment functions but also call your custom function. — vishakvkt, Oct 18 '11 at 11:59
Why do you want to add these functions? If you want to debug your program there a better methods available. — tune2fs, Oct 18 '11 at 12:38
I was going to vote to close this question as a duplicate of http://stackoverflow.com/questions/7526550/instrumenting-c-c-codes-using-llvm (I mean, look at the titles) and I noticed that you were the author of both. How do you expect StackOverflow users to give you answers this time that they didn't give you last time (short of doing it for you, which won't happen)? — Pascal Cuoq, Oct 18 '11 at 14:42
In your example you've missed a lot of potential memory accesses (function calls, pointer dereferencing, variable reads, etc). Virtually every instruction of IR code can potentially access memory and you can't know for sure until the final assembly is generated. Instrumenting every single IR line is most definitely a bad idea and a tool like [valgrind](http://valgrind.org/) might be better suited for your problem. Can you give us a bit more detail as to what you're trying to accomplish? — Ze Blob, Oct 18 '11 at 14:59
@Ze Blob, why "every instruction"? Only `store` and `load` can do a memory access. But of course one should exclude local variable reads/writes (i.e., pointers returned by `alloca` which can be potentially optimised out). — SK-logic, Dec 20 '11 at 07:46
@SK-logic, My understanding is that CPUs have a limitted amount of registers and it's possible that there will not be enough registers to store all the values you're working with. In which case the compiler has to write one of the registers to memory (stack) in order to make room. This is determined during the register allocation pass which is dependent on the target architecture. Because this pass is executed when transforming IR code into assembly, it's difficult to determine whether a value will be written or read from memory by just looking at IR code. — Ze Blob, Dec 21 '11 at 00:38
@Ze Blob, I doubt anyone would be interested in instrumenting stack frame access. Anyway, an order of such operations is not guaranteed, LLVM will reshuffle the entries with no side effects (and can even do so across the basic blocks), so adding something around every instruction is pointless. — SK-logic, Dec 21 '11 at 08:12
@SK-logic I understand that and now that I take another look at the example, I think that I may have missed the fact that the `balance` variable isn't on the stack (oops!). It still doesn't explain the instrumenting of the `read()` function (maybe he wants to instrument `this`?). — Ze Blob, Dec 22 '11 at 23:56

score 9 · Answer 1 · answered Oct 20 '11 at 16:47

I'm not very familiar with LLVM, but I am a bit more familiar with GCC (and its plugin machinery), since I am the main author of GCC MELT (a high level domain specific language to extend GCC, which by the way you could use for your problem). So I will try to answer in general terms.

You should first know why you want to adapt a compiler (or a static analyzer). It is a worthwhile goal, but it does have drawbacks (in particular, w.r.t. redefining some operators or others constructs in your C++ program).

The main point when extending a compiler (be it GCC or LLVM or something else) is that you very probably should handle all its internal representation (and you probably cannot skip parts of it, unless you have a very narrow defined problem). For GCC it means to handle the more than 100 kinds of Tree-s and nearly 20 kinds of Gimple-s: in GCC middle end, the tree-s represent the operands and declarations, and the gimple-s represent the instructions. The advantage of this approach is that once you've done that, your extension should be able to handle any software acceptable by the compiler. The drawback is the complexity of compilers' internal representations (which is explainable by the complexity of the definitions of the C & C++ source languages accepted by the compilers, and by the complexity of the target machine code they are generating, and by the increasing distance between source & target languages).

So hacking a general compiler (be it GCC or LLVM), or a static analyzer (like Frama-C), is quite a big task (more than a month of work, not a few days). To deal only with a tiny C++ programs like you are showing, it is not worth it. But it is definitely worth the effort if you plain to deal with large source software bases.

Regards

score 3 · Accepted Answer · answered Dec 20 '11 at 06:03

Try something like this: ( you need to fill in the blanks and make the iterator loop work despite the fact that items are being inserted )

class ThePass : public llvm::BasicBlockPass {
  public:
  ThePass() : BasicBlockPass() {}
  virtual bool runOnBasicBlock(llvm::BasicBlock &bb);
};
bool ThePass::runOnBasicBlock(BasicBlock &bb) {
  bool retval = false;
  for (BasicBlock::iterator bbit = bb.begin(), bbie = bb.end(); bbit != bbie;
   ++bbit) { // Make loop work given updates
   Instruction *i = bbit;

   CallInst * beforeCall = // INSERT THIS
   beforeCall->insertBefore(i);

   if (!i->isTerminator()) {
      CallInst * afterCall = // INSERT THIS
      afterCall->insertAfter(i);
   }
  }
  return retval;
}

Hope this helps!

You should not do it before and after every instruction, but only for `store` and `load` for genuine pointers (not the reducible local `alloca`s). — SK-logic, Dec 20 '11 at 07:47
You should be returning true from the function runOnBasicBlock to indicate that the instructions in the basic block have been changed. — ConfusedAboutCPP, Sep 26 '12 at 12:19

Instrumenting C/C++ code using LLVM

2 Answers2

Linked