3

I'd like to write a function that would have some optional code to be executed or not depending on user settings. The function is cpu-intensive and having ifs in it would be slow since the branch predictor is not that good.

My idea is making a copy in memory of the function and replace NOPs with a jump when I don't want to execute some code. My working example goes like this:

int Test()
{
    int x = 2;
    for (int i=0 ; i<10 ; i++)
    {
        x *= 2;

        __asm {NOP}; // to skip it replace this
        __asm {NOP}; // by JMP 2 (after the goto)
            x *= 2; // Op to skip or not

        x *= 2;
    }
    return x;
}

In my test's main, I copy this function into a newly allocated executable memory and replace the NOPs by a JMP 2 so that the following x *= 2 is not executed. JMP 2 is really "skip the next 2 bytes".

The problem is that I would have to change the JMP operand every time I edit the code to be skipped and change its size.

An alternative that would fix this problem would be:

__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
    x *= 2; // Op to skip or not
dont_do_it:
x *= 2;

I would then want to skip or not the goto, which has a fixed size. Unfortunately, in full optimization mode, the goto and the x*=2 are removed because they are unreachable at compilation time.

Hence the need to keep that dead code.

I'm using VStudio 2008.

Gabriel
  • 2,841
  • 4
  • 33
  • 43
  • 13
    I smell premature optimization and bug prone code! :D – Billy ONeal Apr 01 '10 at 20:23
  • 4
    branch predictor is not that good - who told you that? i heard abotu 99% prob. i think sub-instruction optimization is really something messy – Andrey Apr 01 '10 at 20:26
  • 1
    Also auto-modifying code is not reentrant. You should mind that if you plan, one day, to program for multi-core systems. – baol Apr 01 '10 at 20:29
  • 2
    If this is a real optimization attempt (rather than just a fun project in polymorphous code), I would recommend entirely separate methods, one with the extra step, and one without. This way the full optimization mode will be able work whatever magic it can on both versions. – Jeffrey L Whitledge Apr 01 '10 at 20:29
  • 1
    @Andrey: branch prediction depends heavily on actual probabilities. Prediction for loop termination is very good because the loop only terminates once. Fill an array with random ints and perform different tasks depending on whether the values are even or odd and you will see roughly 50%. – danben Apr 01 '10 at 20:29
  • @danben - but it will not hurt performance so than you notice! it is so internal – Andrey Apr 01 '10 at 20:34
  • @danben: in this case the branch presumably will be taken (or not) the same for each of the 10 loop iterations, so the predictor could in theory do an even better job on this than it does for loop terminations. – Steve Jessop Apr 01 '10 at 20:36
  • It might very well be premature optimization, but I'm still interested in toying with that (: – Gabriel Apr 01 '10 at 20:43
  • @Andrey - everything you have said so far is a meaningless generalization. There are some applications where micro-optimizations are necessary. If everyone coded using vague rules of thumb, the most interesting problems would never be solved. As the OP has not provided any context, we have no way to know whether this is one of those cases or not, so why not entertain the question for the sake of it? – danben Apr 01 '10 at 20:51
  • @Steve Jessop - not sure if I'm missing something, but I don't see where he has mentioned the condition that determines when the function should be executed, so how can you tell? – danben Apr 01 '10 at 20:52
  • Questioner says "depending on user settings", not "depending on something that varies within the loop". Besides, if the condition changed during the loop, then he would have to monkey-patch the code again, and flush the icache, while the loop is running. I suppose it's possible that this (a) happens rarely enough that it's cheaper than one branch per loop, and (b) occurs unconditionally within some part of the loop, and therefore isn't just introducing one branch in order to remove another one. So I'm very glad I said "presumably" and not "definitely" ;-) – Steve Jessop Apr 01 '10 at 21:00
  • What CPU are you coding for? (be specific, please). – Paul Nathan Apr 01 '10 at 21:22
  • The question is not CPU specific, probably compiler specific (VStudio 2008) or maybe language specific, which is C++. Anyhow I'd like it to work with any x86. I actually use a Pentium 4 (x86 family 15 model 6 stepping 5 GenuineIntel ~2999 MHz), and a core 2 (T5800). Is that specific enough ? (: – Gabriel Apr 02 '10 at 18:49
  • @Gabriel: most CPUs use somewhat different pipeline prediction/branch strategies, IE, how many bits in the branch predictor buffer; the depth of the pipeline, prefetch tuning, etc, etc. As soon as you start considering the branch predictor costs, you are optimizing for a CPU family. – Paul Nathan Apr 02 '10 at 20:24

7 Answers7

6

You can cut the cost of the branch by up to 10, just by moving it out of the loop:

int Test()
{
    int x = 2;
    if (should_skip) {
        for (int i=0 ; i<10 ; i++)
        {
            x *= 2;
            x *= 2;
        }
    } else {
        for (int i=0 ; i<10 ; i++)
        {
            x *= 2;
            x *= 2;
            x *= 2;
        }
    }

    return x;
}

In this case, and others like it, that might also provoke the compiler into doing a better job of optimising the loop body, since it will consider the two possibilities separately rather than trying to optimise conditional code, and it won't optimise anything away as dead.

If this results in too much duplicated code to be maintainable, use a template that takes x by reference:

    int x = 2;
    if (should_skip) {
        doLoop<true>(x);
    } else {
        doLoop<false>(x);
    }

And check that the compiler inlines it.

Obviously this increases code size a bit, which will occasionally be a concern. Whichever way you do it though, if this change doesn't produce a measurable performance improvement then I'd guess that yours won't either.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
4

If the number of permutations for the code is reasonable, you can define your code as C++ templates and generate all variants.

ndim
  • 35,870
  • 12
  • 47
  • 57
4

You do not specify what compiler and platform you are using, which will prevent most people from being able to help you. For example, on some platforms, the code section is not going to be writeable, so you won't be able to replace the NOPs with a JMP.

You are trying to pick-and-choose the optimizations offered to you by the compiler and second-guessing it. In general, it's a bad idea. Either write the whole inner loop block in assembly, which would prevent the compiler eliminating is as dead code, or put the damn if statement in there and let the compiler do its thing.

I'm also dubious that the branch prediction is bad enough where you will gain any sort of a net win from doing what you're proposing. Are you sure this isn't a case of premature optimization? Have you written the code in the most obvious way possible and only then determined that its performance isn't good enough? That would be my suggested start.

RarrRarrRarr
  • 3,712
  • 1
  • 20
  • 14
1

Here's an actual answer to the actual question!

volatile int y = 0;

int Test() 
{
    int x = 2; 
    for (int i=0 ; i<10 ; i++) 
    { 
        x *= 2; 

        __asm {NOP}; // to skip it replace this 
        __asm {NOP}; // by JMP 2 (after the goto) 
        goto dont_do_it;
    keep_my_code:
        x *= 2; // Op to skip or not 
    dont_do_it: 
        x *= 2; 
    }
    if (y) goto keep_my_code;
    return x; 
} 
Jeffrey L Whitledge
  • 58,241
  • 9
  • 71
  • 99
  • Thanks for the suggestion. I actually tried that. You wouldn't believe what the compiler generates... The result is too messy to be changed: 1. The code inside the loop, after goto dont_do_it is still dead, so it's removed from the loop 2. The end of the loop including the code removed is replicated in place of an actual jump, so keep_my_code is executed only after if(y) You're still the first one to try to actually answer the actual question (: – Gabriel Apr 01 '10 at 20:53
  • @Gabriel - Oh wow. It sounds like the shape of the highly optimized code is just too slippery to ever allow for run-time code manipulation. It sounds like you're going to have to write the whole thing yourself in assembler, if you still want to go down that road. – Jeffrey L Whitledge Apr 01 '10 at 20:59
  • I've though of that, but I'm afraid the compiler might throw away unreachable asm. – Gabriel Apr 02 '10 at 18:50
0

Is this x64? You might be able to use function pointers and a conditional move to avoid the branch predictor. Load the address of the procedure based on the user settings; one of the procedures could be a dummy that does nothing. You should be able to do this without any inline ASM at all.

danben
  • 80,905
  • 18
  • 123
  • 145
  • I'd think that if you're trying to avoid the cost of a branch, then introducing an out-of-line function call isn't acceptable. But I may be missing something. – Steve Jessop Apr 01 '10 at 20:33
  • @Steve Jessop - if I'm understanding the OP correctly, he wants to avoid premature execution of the body of his expensive function. Also, it is not the branching itself but the branch penalty that is expensive. (Keep in mind I'm just playing devil's advocate - I have no idea if the branch penalty is making an actual difference in his application, but I see no reason not to give the benefit of the doubt.) – danben Apr 01 '10 at 20:44
  • I'm not sure a CALL would be faster than an IF, even an IF that breaks the branch predictor, which would not be the case anyway. – Gabriel Apr 01 '10 at 20:56
  • If you aren't worried about the branch penalty then what is the point of the question? – danben Apr 01 '10 at 21:04
  • @danben: sure, I'm waving my hands a lot by saying that a call to a variable address is even more expensive than a mis-predicted branch. It might not be, I don't actually know. But I suspect they both play merry havoc with the instruction pipeline. – Steve Jessop Apr 01 '10 at 21:08
  • @Steve Jessop - I believe that is not the case; the conditional move removes any possibility of a branch penalty (as there is no branch). The worst thing it does is delay the call until the address is known. I would have to imagine this would be less expensive than flushing and reloading the pipeline. – danben Apr 02 '10 at 00:16
  • OK, I don't know how x64 (or rather, any specific x64 CPU) does a computed call. You're saying it does it without interrupting the pipeline, so the pipeline will simultaneously contain instructions from here, and instructions from way-megabytes-over-there-determined-at-runtime, just as if it were sequential instructions or a correctly-predicted branch? In that case, I see why you're saying a call through a function pointer is cheaper than a possibly-mispredicted branch, at least ignoring possible overhead of the calling convention. Thanks. – Steve Jessop Apr 02 '10 at 01:01
  • No, that's not what I'm saying at all - I'm just saying that with an indirect function call, the pipeline is only stalled, whereas with a mispredicted branch it needs to be flushed, which I believe (but am not entirely certain) is more expensive. – danben Apr 02 '10 at 02:53
0

This may give insight:

#pragma optimize for Visual Studio.

That said, for this particular problem I would hand-code into ASM, using the VS asm output as a reference point.

At the meta level, I would have to be very certain this was the best design & algorithm for what I was doing before I started optimizing for the CPU pipe.

Community
  • 1
  • 1
Paul Nathan
  • 39,638
  • 28
  • 112
  • 212
  • Using the vs asm output would be quite painful coz I would have to rework a bunch of asm every time I change the base C++. Remember I'm trying to have a solution that does not require much work whenever I change the C++ code. – Gabriel Apr 02 '10 at 18:46
  • @Gabriel: No, what I am suggesting is writing the whole routine in asm. It would be modified separately, like any other function. – Paul Nathan Apr 02 '10 at 20:06
0

If you get this to work then I would profile it to make sure that it really is faster for you. On modern CPUs there is very little you can do that is slower than modifying code that is already in the cpu cache, or worse, the cpu pipeline. The cpu basically has to throw out all the work that is in the pipeline and start again.

jcoder
  • 29,554
  • 19
  • 87
  • 130