static analysis of linux kernel on source code or LLVM IR?

Question

in https://www.usenix.org/system/files/sec21-tan.pdf the authors do static analysis on LLVM IR of linux kernel (a pass for call graph construction, a pass for data flow analysis and alias analysis and ...). and in some other papers I see they do static analysis on LLVM IR and not the source code. my question is why they do their static analysis on LLVM IR? why they don't analyze the source code of linux kernel instead? (for example, they can construct the call graph with analyzing the source code but they construct it by analyzing the LLVM IR).

score 1 · Accepted Answer · answered Oct 20 '22 at 22:24

Analyzing the LLVM IR simplifies analysis of the semantics of the program while analyzing the source code is needed to see what the program does in the terms of the programming language. What I mean is that the C expression *x is definitely "performing an indirection" but it may or may not load or store to memory, for instance the larger expression &*x does not even though it contains *x. This sort of thing doesn't happen with LLVM IR. Every memory access is either a load or store instruction, or a memory access occurs inside a called function through a call instruction. However if x is NULL then *x is still undefined behaviour even if the larger expression is &*x, and you won't be able to see that bug by looking only at the LLVM IR.

LLVM also has a bunch of analysis built in, for instance LLVM already has the ability to build a call graph. Sometimes the call graph isn't immediately obvious from the source code and you need to run some optimizations to see what the callee is (or to remove dead code, eliminating function calls with it), and LLVM performs optimizations quite well too.

thanks. all the analysis can be done with LLVM IR? I mean is there any information that exists in source code but obtaining such information in LLVM IR is hard? is there any situation analyzing the source code is better than LLVM IR? — saha, Oct 21 '22 at 17:29
By definition LLVM IR must contain enough information to produce the native machine instructions that actually run. There absolutely are cases where different code produces the same LLVM IR. Macro expansion (#define), or which type of two typedef'd types was named, and on and on. I phrased this in my answer as being about vocabulary, whether you want your analysis to operate _in terms of_ the original source language, then you must work on the original source text. — Nick Lewycky, Oct 21 '22 at 21:29

static analysis of linux kernel on source code or LLVM IR?

1 Answers1