2

Processors are known to have special instructions for decrementing a counter and branch if the counter is zero with very low latency as the branch instruction does not need to wait for the counter decrement passing through an integer unit.

Here is a link to the ppc instruction:

https://www.ibm.com/support/knowledgecenter/ssw_aix_53/com.ibm.aix.aixassem/doc/alangref/bc.htm

My usual way of doing what I believe triggers a compiler to generate the appropriate instructions is as follows:

unsigned int ctr = n;
while(ctr--)
  a[ctr] += b[ctr];

Readability is high and it is a decrementing loop branching on zero. As you see the branch technically occurs if counter is zero before decrement. I was hoping the compiler could do some magic and make it work anyway. Q: Would a compiler have to break any fundamental rules of C in order to mangle it to special decrement and branch conditional instructions (if any)?

Another approach:

unsigned int ctr = n+1;
while(--ctr) {
  a[ctr-1] += b[ctr-1];
}

The branch now happen after decrement but there are constants roaming making ugly code. An "index" variable being one less than counter would make it look a little prettier I guess. Looking at available ppc instructions the extra calculation in finding the a and b adress can still fit single instruction as load may also perform adress arithmetic (add). Not so sure about other instruction sets. My main problem though is if n+1 is larger than an max. Q: Will the decrement pull it back to max and loop as usual?

Q: Is there a more commonly used pattern in C for allowing the common instruction?


Edit: ARM has a decrement and branch operation but branches only if value is NOT zero. There appears to be an extra condition just like the ppc bc. As I see it it is from C point of view it is very much the same thing so I expect a code snippet to be compilable to that form too without any C standard violation. http://www.heyrick.co.uk/armwiki/Conditional_execution


Edit: Intel has virtually the same branching instruction as ARM: http://cse.unl.edu/~goddard/Courses/CSCE351/IntelArchitecture/InstructionSetSummary.pdf

Andreas
  • 5,086
  • 3
  • 16
  • 36
  • Readability is high? Readability is so high that pre/post increment/decrement have been removed from Swift3. I'd try memcpy or memmove. – gnasher729 May 31 '16 at 21:05
  • I personally have no problem with pre/post increment. memcpy/memmove is not an option: He is not copying, he is adding values (`+=` instead of `=`). – Aconcagua May 31 '16 at 21:16

3 Answers3

2

This is going to depend on the efforts of the optimization writers of your compiler.

For instance, a bdz opcode could be used at the bottom of a loop to "jump over" a different jump that returned to the top. (This would be a bad idea, but it could happen.)

loop:
     blah
     blah

     bdz  ... out
     b loop
out: 

Far more likely would be to decrement and branch if NOT zero, which the PPC also supports.

loop:
    blah
    blah

    bdnz ... loop

fallthru:

Unless you have a compelling reason to try to game the opcodes, I'd suggest that you try to write clean, readable code that minimizes side effects. Your own change from post-decrement to pre-decrement is a good example of that-- one less (un-used) side effect for the compiler to worry about.

That way, you'll get the most bang for your optimizing buck. If there's a platform that needs a special version of your code, you can #ifdef the whole thing, and either include inline assembly, or rewrite the code in conjunction with reading the assembly output and running the profiler.

aghast
  • 14,785
  • 3
  • 24
  • 56
  • How do those manage n=0? Your answer gave more ideas. I had the idea before the conditional branch was always at the beginning of the loop. Now I'm not so sure... Having a conditional looking for zero before the loop and the have condition branch checking if it should be run again might be more useful depending on if there is a branch predictor common loop size and how many instructions are fetched at a time. Ugh, I start regretting asking the question in the first place. So much to consider. – Andreas Jun 01 '16 at 16:09
  • 1
    Back in the day, when dec/bnz was new and cool, and data buses were 16 bits wide, max, there were a lot of compilers that generated loops that basically started with a "jump to bottom", then the bottom of the loop had a `decrement-branch-not-zero to top` opcode. (This was before cache misses became the dominant factor in code generation.) Essentially, a do-while loop was the "natural" form, and while and for loops were "do while" loops with a jump-to-bottom at the top. :-) – aghast Jun 03 '16 at 21:21
2

Definitely depends on the compiler, but it's an instruction that is great for performance, so I'd expect compilers to try and maximize its usage.

Since you're linking an AIX reference, I'm assuming you're running xlc. I don't have access to an AIX machine but I do have access to xlc on a Z machine. The equivalent Z counterpart is the Branch On Count (BCTR) instruction.

I tried 5 examples and checked the listings

int len = strlen(argv[1]);
//Loop header
argv[1][counter] += argv[2][counter];

With the following loop headers:

for (int i = 0; i < len; i++)
for (int i = len-1; i >= 0; i--)
while(--len)
while(len--)
while(len){
   len--;

All 5 examples use branch on count at -O1 and higher, and none of them use it at opt 0.

I'd trust a modern compiler to be able to find branch on zero opportunities with any standard loop structure.

Artur Kink
  • 498
  • 1
  • 10
  • 15
  • Does all examples generate the same assembly? Because that would be really cool. Btw while(--len) have to watch out for starting with zero. – Andreas Jun 01 '16 at 16:02
  • No the code is slightly different, but overall pretty similar. Completely identical code would be impressive lol. – Artur Kink Jun 03 '16 at 20:54
1

What about this:

do
{
    a[ctr] += b[ctr];
}
while(--ctr);

You'd need an additional check, however:

if(n != 0)
{
    /*...*/
}

if you cannot guarantee this by other means...

Oh, and be aware that ctr has different final values depending on which loop variant you select (0 in mine and your second one, ~0 in your first)...

Aconcagua
  • 24,880
  • 4
  • 34
  • 59