Poor LLVM JIT performance

Question

I have a legacy C++ application that constructs a tree of C++ objects. I want to use LLVM to call class constructors to create said tree. The generated LLVM code is fairly straight-forward and looks like repeated sequences of:

; ...
%11 = getelementptr [11 x i8*]* %Value_array1, i64 0, i64 1
%12 = call i8* @T_string_M_new_A_2Pv(i8* %heap, i8* getelementptr inbounds ([10 x i8]* @0, i64 0, i64 0))
%13 = call i8* @T_QueryLoc_M_new_A_2Pv4i(i8* %heap, i8* %12, i32 1, i32 1, i32 4, i32 5)
%14 = call i8* @T_GlobalEnvironment_M_getItemFactory_A_Pv(i8* %heap)
%15 = call i8* @T_xs_integer_M_new_A_Pvl(i8* %heap, i64 2)
%16 = call i8* @T_ItemFactory_M_createInteger_A_3Pv(i8* %heap, i8* %14, i8* %15)
%17 = call i8* @T_SingletonIterator_M_new_A_4Pv(i8* %heap, i8* %2, i8* %13, i8* %16)
store i8* %17, i8** %11, align 8
; ...

Where each T_ function is a C "thunk" that calls some C++ constructor, e.g.:

void* T_string_M_new_A_2Pv( void *v_value ) {
  string *const value = static_cast<string*>( v_value );
  return new string( value );
}

The thunks are necessary, of course, because LLVM knows nothing about C++. The T_ functions are added to the ExecutionEngine in use via ExecutionEngine::addGlobalMapping().

When this code is JIT'd, the performance of the JIT'ing itself is very poor. I've generated a call-graph using kcachegrind. I don't understand all the numbers (and this PDF seems not to include commas where it should), but if you look at the left fork, the bottom two ovals, Schedule... is called 16K times and setHeightToAtLeas... is called 37K times. On the right fork, RAGreed... is called 35K times.

Those are far too many calls to anything for what's mostly a simple sequence of call LLVM instructions. Something seems horribly wrong.

Any ideas on how to improve the performance of the JIT'ing?

This would be a good question to take to the llvmdev mailing list — Eli Bendersky, Nov 12 '12 at 17:11
I already asked on the LLVM mailing list. Only 1 guy replied initially, but, after continued discussion, stopped replying. — Paul J. Lucas, Nov 12 '12 at 17:17
Are you, by any chance, compiling with -o3? Also, it might be worth taking a look at the generated code. The functions you mentioned - scheduling and register allocation (RAGreed) - are part of the codegen phase, long after the LLVM IR has already been lowered and no longer looks like the original. — Oak, Nov 13 '12 at 10:18
I changed it to use `CodeGenOpt::None` and the performance improved by an order of magnitude. However, I'd still like to improve it by yet another order of magnitude. — Paul J. Lucas, Nov 13 '12 at 18:04

score 0 · Answer 1 · answered Nov 14 '12 at 01:08

Another order of magnitude is unlikely to happen without a huge change in how the JIT works or looking at the particular calls you're trying to JIT. You could enable -fast-isel-verbose on llc -O0 (e.g. llc -O0 -fast-isel-verbose mymodule.[ll,bc]) and get it to tell you if it's falling back to the selection dag for instruction generation. You may want to profile again and see what the current hot spots are.

Poor LLVM JIT performance

1 Answers1