How to improve inline function efficiency?

Question

I profiled my code and found that one inline function takes about 8% of the samples. The function is to convert matrix subscripts to indices. It is quite like the matlab function sub2ind.

inline int sub2ind(const int sub_height, const int sub_width, const int width) {
    return sub_height * width + sub_width;
}

I guess the compiler does not perform inline expansion, but I don't know how to check that out.

Is there any way to improve this? Or explicitly let the compiler perform inline expansion?

Assembly verification would be good, but from my experience using profilers, if the function symbol shows up in a profiler, it's not being inlined, as inlined functions make the hotspot show up in the caller rather than the function. If the assembly verification reveals a function call, I would suggest making sure the function is visible in the header (linkers can inline code, but I've had a better success when the function can be inlined during compilation -- though I unfortunately have to work with old compilers at my work and that advice may be poor for newer ones). — , May 05 '15 at 14:31
Then another thing to try is to verify the build settings, make sure that optimizations are properly turned on. I'd consider something like __forceinline as a last resort. — , May 05 '15 at 14:32

score 5 · Accepted Answer · edited May 23 '17 at 11:51

Did you remember to compile with optimizations? Some compilers have an attribute to force inlining, even when the compiler doesn't want to: see this question.

But it probably has already; you can try having your compiler output the assembly code and try to check for sure that way.

It is not implausible that index calculations can be a significant fraction of your time -- e.g. if your algorithm is reading from a matrix, a little bit of calculation, then writing back, then index calculations really are a significant fraction of your compute time.

Or, you've written your code in a way that the compiler can't prove that width remains constant throughout your loops*, and so it has to reread it from memory every time, just to be sure. Try copying width to a local variable and use that in your inner loops.

Now, you've said that this takes 8% of your time -- that means it is unlikely that you can possibly get anything more than an 8% improvement to your runtime, and probably much less. If that's really worth it, then the thing to do is probably to fundamentally change how you iterate through the array.

e.g.

if you tend to access the matrix in a linear fashion, you could write some sort of two-dimensional iterator class that you can advance up, down, left, or right, and it will use additions everywhere instead of multiplication
same thing, but writing an "index" class that just holds the numbers rather than pretending to be a pointer
if width is a compile-time constant, you could make it explicitly so, e.g. as a template parameter, and your compiler might be able to do more clever things with the multiplication

*: You could have done something silly, like put the data structure for your matrix in the very memory where you're storing the matrix entries! So when you update the matrix, you might change the width. The compiler has to guard against these loopholes, so it can't do optimizations it 'obviously should' be able to do. And sometimes, the sort of thing that a loophole in one context can well be the programmer's obvious intent in another context. Generally speaking, these sorts of loop holes tend to be all over the place, and the compiler is better at finding these loopholes than humans are at noticing them.

Would you please explain your last point? In my code, the matrix size (i.e.`width`) remains constant, but `sub_height` and `sub_width` are changing over the loop. — ZHOU, May 04 '15 at 04:12
But does the compiler *know* it's constant? There are some very easy ways to open up loopholes without knowing it. e.g. if your always calling `sub2ind(x, y, mymatrix.width());`, and the compiler doesn't have a way to absolutely prove that wherever the width is being stored in memory is not being changed (e.g. because there is nothing to deny the possibility that you decided to store the contents of your matrix starting at the very memory address where the width is stored, so `mymatrix[0] = 1;` would change the width), then it has to write code that checks for changes in the width. — , May 04 '15 at 04:27

vitaut · Answer 2 · 2015-05-04T23:45:23.550

As @user3528438 mentioned, you can look at the assembly output. Consider the following example:

inline int sub2ind(const int sub_height, const int sub_width, const int width) {
    return sub_height * width + sub_width;
}

int main() {
    volatile int n[] = {1, 2, 3};
    return sub2ind(n[0], n[1], n[2]);
}

Compiling it without optimization (g++ -S test.cc) results in the following code with sub2ind not inlined:

main:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $32, %rsp
    movl    $1, -16(%rbp)
    movl    $2, -12(%rbp)
    movl    $3, -8(%rbp)
    movq    -16(%rbp), %rax
    movq    %rax, -32(%rbp)
    movl    -8(%rbp), %eax
    movl    %eax, -24(%rbp)
    movl    -24(%rbp), %edx
    movl    -28(%rbp), %ecx
    movl    -32(%rbp), %eax
    movl    %ecx, %esi
    movl    %eax, %edi
    call    _Z7sub2indiii ; call to sub2ind
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

while compiling with optimization (g++ -S -O3 test.cc) results in sub2ind being inlined and mostly optimized away:

main:
.LFB1:
    .cfi_startproc
    movl    $1, -24(%rsp)
    movl    $2, -20(%rsp)
    movq    -24(%rsp), %rax
    movl    $3, -16(%rsp)
    movq    %rax, -40(%rsp)
    movl    $3, -32(%rsp)
    movl    -32(%rsp), %eax
    movl    -36(%rsp), %edx
    movl    -40(%rsp), %ecx
    imull   %ecx, %eax
    addl    %edx, %eax
    ret
    .cfi_endproc

So if you are convinced that your function is not inlined, first make sure that you enable optimization in the compiler options.

It might be more reasonable to try compiling the program w/ user input rather than hardcoded parameters. Think the function call will still be eliminated, but good to verify. — jma127, May 04 '15 at 03:37
@jma127 Yes, but this is just an example to demonstrate how to check whether the function is actually called or not. — vitaut, May 04 '15 at 03:41
Sure, but this is an example of just precomputing the value rather than inlining -- think the intent was to verify inlining on arbitrary input. — jma127, May 04 '15 at 23:29
@jma127 Fair enough, I made the input mutable to avoid the computation being optimized away. — vitaut, May 04 '15 at 23:46

How to improve inline function efficiency?

2 Answers2