I am done with my project. Posting my approach for the benefit of others.
The main function of LLVM backend is to convert the Intermediate Representation
to the final executable depending on the target architecture and other
specification. The LLVM backend itself consists of several phases which does
target specific optimization,Instruction Selection, Scheduling and Instruction
Emitting. These phases are required because the IR is a very generic representation and
requires a lot of modifications to finally convert them to target specific executables.
1)Logging every time the compiler generates jmp *(eax)
We can achieve this by adding print statements to the Instruction Emitting/Printing phase. After most of the main conversion from IR is done, there is an AsmPrinter pass which goes through each Machine Instruction in a Basic Block of every function. This main loop is at lib/CodeGen/AsmPrinter/AsmPrinter.cpp:AsmPrinter::EmitFunctionBody()
. There are other related functions like EmitFunctionEpilogue,EmitFunctionPrologue. These functions finally call EmitInstruction for specific architecture eg: lib/Target/X86/X86AsmPrinter.cpp
. If you tinker around a bit, you can call MI.getOpcode() and compare it with defined enums for the architecture to print a log.
For example for a jump using register in X86, it is X86::JMP64r. You can get the register associated using MI.getOperand(0) etc.
if(MI->getOpcode() == X86::JMP64r)
dbgs() << "Found jmp *x instruction\n";
2)Replacing the instruction
The required changes vary depending on the type of replacement you require. If you need more context about registers,or previous instructions, we would need to implement the changes higher up in the Pass chain. There is a representation of instructions called Selection DAG( directed acyclic graph ) which stores dependencies of each instruction to previous instructions. For example, in the sequence
mov myvalue,%rax
jmp *rax
The DAG would have the jmp instruction pointing to the move instruction ( and possibly other nodes before it) since the value of rax depends on the mov instruction. You can replace the Node here with your required Nodes. If done correctly, it should finally change the final instructions.
The SelectionDAG code is at lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
. Always best to poke around first to figure out the ideal place to change. Each IR statement goes through multiple changes before the DAG is topologically sorted so that the Instructions are in a linear sequence. The graphs can be viewed
using -view-dag* options seen in llc --help-hidden
.
In my case, I just added a specific check in EmitInstruction and added code to Emit two instructions that i wanted.
LLVM documentation is always there, but i found Eli Bendersky's two articles more helpful than any other resources. Life of LLVM Instruction and Deeper look into LLVM Code Generation. The articles discuss the very complex TableGen descriptions and the instruction matching process as well which is kind of cool if you are interested.