C How extract predefined huge switch from huge loop without loss performance?

Question

I have a bottleneck, which looks like this:

void function(int type) {
    for (int i = 0; i < m; i++) {
        // do some stuff A
        switch (type) {
        case 0:
            // do some stuff 0
            break;
        [...]
        case n:
            // do some stuff n
            break;
        }
        // do some stuff B
    }
}

n and m are large enough.

m millions, sometimes hundreds of millions.

n is the 2 ^ 7 - 2 ^ 10 (128 - 1024)

Chunks of code A and B are sufficiently large.

I rewrote the code (via macros) as follows:

void function(int type) {
    switch (type) {
    case 0:
        for (int i = 0; i < m; i++) {
            // do some stuff A
            // do some stuff 0
            // do some stuff B
        }
        break;
    [...]
    case n:
        for (int i = 0; i < m; i++) {
            // do some stuff A
            // do some stuff n
            // do some stuff B
        }
        break;
    }   
}

As a result, it looks like this in IDA for this function:

Is there a way to remove the switch from the loop:

without creating a bunch of copies of the loop
not create huge function with macros
without losing performance?

A possible solution seems to me the presence of goto variable. Something like this:

void function(int type) {
    label* typeLabel;
    switch (type) {
    case 0:
        typeLabel = &label_1;
        break;
    [...]
    case n:
        typeLabel = &label_n;
        break;
    }

    for (int i = 0; i < m; i++) {
        // do some stuff A
        goto *typeLabel;
        back:
        // do some stuff B
    }

    goto end;

    label_1:
    // do some stuff 0
    goto back;
    [...]
    label_n:
    // do some stuff n
    goto back;

    end:
}

The matter is also complicated by the fact that all of this will be carried out on different Android devices with different speeds.

Architecture as ARM, and x86.

Perhaps this can be done assembler inserts rather than pure C?

EDIT:

I run some tests. n = 45,734,912

loop-within-switch: 891,713 μs

switch-within-loop: 976,085 μs

loop-within-switch 9.5% faster from switch-within-loop

For example: simple realisation without switch takes 1,746,947 μs

@JamesRoot Run a different code depending on the value of the variable `type`. — Enyby, Sep 14 '15 at 03:22
You can expect quite different answers, depending on whether you are targeting C or C++. Why have you tagged and titled this question with both languages? — paddy, Sep 14 '15 at 03:41
@Enyby You could create functions and then use a pointer to a function to jump to the correct one. You could also probably do what you what you want better in pure assembly, but I believe a c/c++ compiler is usually a better optimizer. I think your second solution, with for loops in each case, is the best one. — Weak to Enuma Elish, Sep 14 '15 at 04:03
@JamesRoot I know about that. Similar question about that: http://stackoverflow.com/questions/2662442/c-function-pointers-vs-switch With this way have another problem - low performance. Jump table is very fast from function call. No push registers, no restore its, no create stack and others stuff. — Enyby, Sep 14 '15 at 04:07
It sounds like you are already tending towards the notion of a jump table. In that case, I would suggest that you remove the `switch` entirely. Since `n` is reasonably small, you could store the jump table as a static array, and all you are left with is the loop and a heap of line labels. However, you might find that the compiler has a hard time optimising all these jumps. It would be interesting to know what the "Do some stuff X" variants are. Another question is whether the "A->X->B" operations need to be in sequence, or if they can be split into three separate loops. — paddy, Sep 14 '15 at 05:16
@paddy "A-> X-> B" to be performed in that order. `A` prepares the data, `X` - is working with them, and `B` - saves the results. The memory capacity is limited, so there is no opportunity to make three separate loops. A disk usage will ruin performance. At the same time, static array of labels sounds like what I need. But how I do this? — Enyby, Sep 14 '15 at 07:53
To get this straight: the low performance is on the switch-within-loop or in the loops-within-switch version? The latter should be as fast as possible, while the former should be slow as molasses. Or am I wrong? — Rudy Velthuis, Sep 14 '15 at 13:51
@RudyVelthuis You're right, but this code is difficult to support and compile it takes 2 minutes on a powerful computer. Prior to that compilation occurred in 5-10 seconds. And I have a suspicion that the function of this size performed poorly on modern processors because they do not fit in the cache, and there are a lot of misses predictor. To make it clearer, I can give you examples of time. On one device loop it was called 11 million times in 530 ms. On the other - 43.3 million for 68 seconds. — Enyby, Sep 14 '15 at 14:10
@Enyby: What are those times for? The same code on two different Android devices? Was it loop-inside-switch, or switch-inside-loop? Cache operates in terms of cache lines, not whole functions. If most of the code in a function never runs, it doesn't matter that it's there. The code that does run just has to fit in the cache. The important question is whether any Android devices use ARM cores that can't predict indirect jumps. (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438d/BABEHAJJ.html, which says that Cortex A15 can, for example.) switch-inside-loop is fine then. — Peter Cordes, Sep 14 '15 at 18:30
@PeterCordes 1. We can not write an application only to specific CPUs, so consider specific processors does not make sense. We must Oriented architecture in general. 2. Yes. loop-inside-switch on the different devices. The same code. — Enyby, Sep 14 '15 at 19:41
@Enyby: I wasn't suggesting making CPU-specific versions. I was asking whether there were *any* ARM CPUs where one indirect branch is a showstopper. I just linked to the A15 docs as an example of what I'm talking about. If all ARM cores have at least minimal indirect-branch performance, then the goto version should be fine to ship. It would be fine on x86. Even Atom has simple indirect-branch prediction. Fancier indirect branch predictors can handle a pattern, but predicting same-address-as-last-time for an unconditional indirect jump is the simplest case by far. — Peter Cordes, Sep 14 '15 at 20:46
Were those two calls (0.53s for 11M and 68s for 43.3M) with the same `type`, and the same size of any global arrays that the code touches? And were the two CPUs you tested on very different in performance? If you didn't control for those factors, it's probably just a general difference in CPU performance, and/or doing a different amount of work. To conclude `switch` is causing a perf problem on the CPU where it's slow, you should compare switch-inside-loop vs. a custom function, on the *same* CPU, with the *same* `type`, with the *same* data in globals. — Peter Cordes, Sep 14 '15 at 20:55
just saw you'd already edited a probably better controlled test into your question. So it looks like the compiler isn't doing great with the switch inside the loop. It'd be worth trying the version that computes the branch target once, and uses it repeatedly to emulate a switch. It may not be the indirect jump itself that was the problem, but rather mapping the `type` to a branch target. Although if there aren't gaps in the possible values, gcc probably used a simple table lookup. — Peter Cordes, Sep 14 '15 at 21:11
@PeterCordes Of course for compare need same enviroment, `(0.53s for 11M and 68s for 43.3M)` it is only example for numbers. It is not comparable. Performance is very different even on a single processor, if you simply select a different type or other conditions of the problem. — Enyby, Sep 14 '15 at 21:44

score 4 · Accepted Answer · answered Sep 14 '15 at 14:50

4

At the moment, the best solution I can see is:

Generate with macros n functions, which will look like this:

void func_n() {
    for (int i = 0; i < m; i++) {
        // do some stuff A
        // do some stuff n
        // do some stuff B
    }
}

Then make an array of pointers to them, and called from the main function:

void main(int type) {
    func* table[n];
    // fill table array with pointers to func_0 .. func_n

    table[type](); // call appropriate func
}

This allows the optimizer to optimize the compiler function func_0 .. func_n. Moreover, they will not be so big.

answered Sep 14 '15 at 14:50

Enyby

4,162
2
33
42

1

I guess that is the best solution, indeed. It is similar to the loop-within-switch solution, but it will very likely have a much better locality. The only problem perhaps is if you must pass variables. – Rudy Velthuis Sep 14 '15 at 14:57
1

@RudyVelthuis: I doubt there's much difference in locality between this and loop-within-switch. Once you get to the loop, all that matters is how many cache lines it takes to hold all the instructions for that copy of the loop. I don't think putting all the loops inside switch cases would make the compiler add extra jumps *inside* each loop. – Peter Cordes Sep 14 '15 at 18:24
@PeterCordes While I can only see one possible advantage. This volume function may adversely affect the compiler optimizer. But this question must be investigated in more detail, to be sure. – Enyby Sep 14 '15 at 19:56
@Enyby: That's true, having one giant function might be harder for the compiler. It will spend a lot of time looking for ways to factor things out, and eliminate redundancy, when what you want it do to is generate separate loops for each case. If there isn't much code outside the loop, there's not much harm in the table-of-functions approach. Separate functions will make the code size the same or larger, but it probably compiles a lot faster at `-O3`. – Peter Cordes Sep 14 '15 at 20:59

mtijanic · Answer 2 · 2015-09-14T11:17:04.727

Realistically, a static array of labels is likely the fastest sane option (array of pointers being the sanest fast option). But, let's get creative.

(Note that this should have been a comment, but I need the space).

Option 1: Exploit the branch predictor

Let's build on the fact that if a certain outcome of a branch happens, the predictor will likely predict the same outcome in the future. Especially if it happens more than once. The code would look something like:

for (int i = 0; i < m; i++) 
{
    // do some stuff A
    if (type < n/2) 
    {
        if (type < n/4) 
        {
            if (type < n/8) 
            {
                if (type == 0) // do some stuff 0
                else           // do some stuff 1
            } 
            else 
            {
                ...
            }
        } 
        else 
        {
             ...
        }
    } 
    else 
    {
        ...
        // do some stuff n
    }

    // do some stuff B
}

Basically, you binary search what to do, in log(n) steps. That is a log(n) possible jumps, but after only one or two iterations, the branch predictor will predict them all correctly, and will speculatively execute the proper instructions without problem. Depending on the CPU, this could be faster than a goto *labelType; back: as some are unable to prefetch instructions when the jump address is calculated dynamically.

Option 2: JIT load the proper 'stuff'

So, ideally, your code would look like:

void function(int type) {
    for (int i = 0; i < m; i++) {
        // do some stuff A
        // do some stuff [type]
        // do some stuff B
    }
}

With all the other 0..n "stuffs" being junk in the current function invocation. Well, let's make it like that:

void function(int type) {
    prepare(type);
    for (int i = 0; i < m; i++) {
        // do some stuff A
        reserved:
        doNothing(); doNothing(); doNothing(); doNothing(); doNothing();
        // do some stuff B
    }
}

The doNothing() calls are there just to reserve the space in the function. Best implementation would be goto B. The prepare(type) function will look in the lookup table for all the 0..n implementations, take the type one, and copy it over all those goto Bs. Then, when you are actually executing your loop, you have the optimal code where there are no needless jumps.

Just be sure to have some final goto B instruction in the stuff implementation - copying a smaller one over a larger one could cause problems otherwise. Alternatively, before exiting function you can restore all the placeholder goto B; instructions. It's a small cost, since you're only doing it once per invocation, not per iteration.

prepare() would be much easier to implement in assembly than in C, but it is doable. You just need the start/end addresses of all stuff_i implementations (in your post, these are label_[i] and label_[i+1]), and memcpy that into reserved.

Maybe the compiler will even let you do:

memcpy((uint8_t*)reserved, (uint8_t*)label_1, (uint8_t*)label_2 - (uint8_t*)label_1);

Likely not, though. You can, however, get the proper locations using setjmp or something like __builtin_return_address / _ReturnAddress within a function call.

Note that this will require write access to the instruction memory. Getting that is OS specific, and likely requires su/admin privileges.

Option 1: I think the compiler does something like this, if it sees fit. I do not think that this code will be compiled into something more productive than the switch. Option 2: It looks like a solution, but it looks like overhead. Plus, if the empty space is not enough, then either all dead or if we check the interval, we will not be able to perform a function in general. Well, it is optimally compiled can not be, because it is a dirty hack. Where it is necessary to store a piece of code in 1024, where we will copy. Plus there are plenty of pitfalls with the optimization of the code optimizer — Enyby, Sep 14 '15 at 14:36
@Enyby: agree about Option1 that trying to outsmart compilers at optimizing `switch` directly is a mistake. About option2, though: if a call/return isn't too much overhead, you could write your loop with a regular function call, and binary patch it with the address of the appropriate function. Or an unconditional jump. Otherwise you might just get libclang to JIT-compile a version of the loop each time a different `type` is needed. Cache the 10 most recently used versions. (This allows branch predictor history for `do some stuff [type]` in recently called copies to maybe still be useful.) — Peter Cordes, Sep 14 '15 at 17:46
Otherwise, yeah you have to get your hands dirty with machine code if you want to memcpy in code for `do some stuff [type]`. You can't just compile the function with labels, and take chunks of instructions. Different chunks of code will assume different variables are live in registers, and stuff like that, because you're taking them from places where the compiler made that happen, and putting them into a place where it didn't. The limited-size thing isn't really a problem. If a block of code is too big for your reserved space, put it elsewhere and put in jumps to it and back. — Peter Cordes, Sep 14 '15 at 17:53
@PeterCordes Call is more slow from jump table. Need push and pop regs - it is not necessary steps. JIT is certainly an interesting option, but it overhead in my case. All variants are known to me in advance and they are not so much. In addition to the android is a JIT problem both quality and productivity. It is not necessary that the device will be a lot of memory, or a powerful CPU. Given that everything is known in advance, it is logical to use the pre-compiled and optimized code compiled on a powerful processor. — Enyby, Sep 14 '15 at 17:57
@PeterCordes Polymorphic code is generally very difficult. If there something goes wrong, then determine the cause would be incredibly difficult. I would like to avoid it. The matter is complicated further by the fact that this code should be used the same variables, registers, and so forth. However, the optimizer can make them different. — Enyby, Sep 14 '15 at 18:02
@Enyby: yeah, I figured linking in a whole optimizing compiler library would be too heavy a solution to be good in an android app. LLVM can JIT from an intermediate representation, not just C++ source, so that could reduce the amount of work the JIT compiler had to do by a large margin. Still, it's not a great solution. However, if `stuffA` and `stuffB` take a lot of code, this lets you get an optimal version of the loop for each `type` without paying the cost of having them stored n times in your executable, where they have to get mapped read from storage into main memory before use. — Peter Cordes, Sep 14 '15 at 18:14
I agree about `memcpy`ing compiler output around, though. You're absolutely right that debugging it would be horrible even if you did get it working once with one compiler's output. I already pointed out the problem of different register states at entry to different blocks. — Peter Cordes, Sep 14 '15 at 18:16

artless noise · Answer 3 · 2015-09-15T13:05:18.507

The compiler is generally good at choosing an optimal form of the switch. For an ARM device you can have a few forms for a dense code snippets. Either a branch table (like a bunch of function pointers) or if the code in the switch is near identical you may do an array index. Semantically something like this,

 dest = &first_switch_pc;
 dest += n*switch_code_size;
 current_pc = dest;

An ARM CPU may do this in a single instruction. This is probably not profitable in your case as the type seems to be constant per loop iteration.

However, I would definitely explore restructuring your code like this,

void function(int type) {
    i = 0;
    if (m==0) return;
    // initialize type_label;
    goto entry;
    while(1) {
        // do some stuff B
        i++;
        if(i < m) break;
    entry:
        // do some stuff A
        goto *type_label;

        label_1:
       // do some stuff 0
       continue;
       [...]
       label_n:
       // do some stuff n
       continue;
    }
}

This will merge the 'A' and 'B' so that it will fit well in the code cache. The 'control flow' from the 'goto label' will then be to the top of the loop. You maybe able to simplify the control flow logic depending on how i is used in the unknown snippets. A compiler may do this for you automatically depending on optimization levels, etc. No one can really give an answer without more information and profiling. The cost of 'stuff A', 'stuff B' and the size of the switch snippets are all important. Examining the assembler output is always helpful.

I'd expect most compilers to compile a `switch` inside a loop to your `goto *type_label` anyway. Although maybe they wouldn't think to spill the result to memory if they ran out of registers, even if it was expensive to compute each time in the loop. Also, a `do { stuffA; goto *type_label; ... ; endswitch: stuffB; } while(i++ < m);` loop structure would be more readable. Note that that `break` will take you out of the loop. You need `goto endswitch;` — Peter Cordes, Sep 14 '15 at 18:03
Thanks, I had `break` from an original switch statement; I changed them to `continue`. We don't know about the snippets, so it is hard to say whether there are spills or not? It is not even 100% clear whether the OP thinks `type` is constant or not (maybe 'do some stuff' changes it). — artless noise, Sep 15 '15 at 13:06

score 0 · Answer 4 · answered Sep 14 '15 at 19:09

This pdf of slides from a presentation about getting gcc to thread jumps is interesting. This is the exact optimization gcc needs to do to compile the switch-inside-loop version similarly to the loop-inside-switch version.

BTW, the loop-inside-switch version should be equivalent in performance to the loop-inside-separate-functions version. Cache operates in terms of cache lines, not whole functions. If most of the code in a function never runs, it doesn't matter that it's there. Only the code that does run takes space in the cache.

If all ARM cores in Android devices have branch-target prediction for indirect jumps, your second implementation of doing the compiler's job for it, and doing an indirect goto inside the loop, is probably the best tradeoff between code size and performance. A correctly-predicted unconditional indirect branch costs about the same as a couple add instructions on x86. If ARM is similar, the savings in code size should pay for it. Those slides talk about some ARM cores having indirect-branch prediction, but doesn't say that all do.

This Anandtech article about A53 cores (the little cores in big.LITTLE) says that A53 vastly increased the indirect-branch prediction resources compared to A7. A7 cores have an 8-entry indirect branch target buffer. That should be enough to make the goto *label in your loop efficient, even on very weak LITTLE cores, unless the rest of your loop has some indirect branches inside the loop. One mispredict on the occasional iteration should only cost maybe 8 cycles. (A7 has a short 8-stage pipeline, and is "partial dual issue, in-order", so branch mispredicts are cheaper than on more powerful CPUs.

Smaller code size means less code to be loaded from flash, and also less I-cache pressure if the function is called with different arguments for type while the do stuff for A and do stuff for B code is still present in I$, and has its branch-prediction history still available.

If the do stuff for [type] code changes how branches in the stuff for A and B code behaves, it may be best to have the entire loop body duplicated, so different copies of the branch have their own prediction entries.

If you want to sort out what's actually slow, you're going to have to profile your code. If ARM is like x86 in having hardware performance counters, you should be able to see which instructions are taking a lot of cycles. Also actually count branch mispredicts, I$ misses, and lots of other stuff.

To make any further suggestions, we'd need to see how big your pieces of code are, and what sort of thing they're doing. Clearly you think loop and switch overhead are making this hot function more of a bottleneck than it needs to be, but you haven't actually said that loop-inside-switch gave better performance.

Unless all the do stuff A, do stuff B, and many of the do stuff [type] blocks are very small, the switch is probably not the problem. If they are small, then yes, it is probably worth duplicating the loop N times.

I run other tests with changed code. Loop with switch and without. With switch it runs approx. 1.6 s, without - 1.0 s. It is very huge difference. Sometimes switches may be very slow. — Enyby, Sep 15 '15 at 02:38

score 0 · Answer 5 · edited May 23 '17 at 12:29

Another solution is use labels as values:

void function(int type) {
    void *type_switch = &&type_break;

    switch (type) {
    case 0:
        type_switch = &&type_0;
        break;
    [...]
    case n:
        type_switch = &&type_n;
        break;
    }

    for (int i = 0; i < m; i++) {
        // do some stuff A

        goto *type_switch;

        type_0: {
            // do some stuff 0
            goto type_break;
        }
        [...]
        type_n: {
            // do some stuff n
            goto type_break;
        }
        type_break: ;

        // do some stuff B
    }
}

This solution is worse than the version with lots of functions.

If not enabled the optimization of the code, the variables will be loaded each time from the stack in the parts of code 0 .. n.
Address goto can also be loaded each time from the stack.
Two extra goto.

C How extract predefined huge switch from huge loop without loss performance?

5 Answers5

Option 1: Exploit the branch predictor

Option 2: JIT load the proper 'stuff'