Arithmetic vs boolean operations

Question

I've come across this piece of code in some forum:

if ( a * b * c * d == 0 ) ....

and the owner claims this is a tad faster than

if (a == 0 || b == 0 || c == 0 || d == 0)

These variables are defined as:

int a, b, c, d;

And their absolute values are guaranteed to be less than or equal to 100. (So we could ignore the possibility of overflowing)

If we just ignore the readability and just focus on the performance, is the claim really correct?

It seems to me that the second approach might actually be faster since you could take advantage of 'short-circuit' sometimes. But then, what-do-I-know?!

`a * b * c * d` can also be short-circuited -- if any of the factors is 0, the product can't be anything else than zero. — , Aug 04 '12 at 20:21
really? Is the processor that smart to recognise a 0 as a factor in a multiplication? — user1508893, Aug 04 '12 at 20:22
@userXXXX you have been quite misleaded :) it's not the processor who does all these tricks -- it's the compiler's optimizer logic. The processor in itself is not smart at all. In fact, it only can add unsigned integers. — , Aug 04 '12 at 20:24
isnt that the point? if any of the factors is 0, the product will be 0 and then both of the if statements will suffice for that — jeremy, Aug 04 '12 at 20:25
Well, ok. My mistake. Then let me rephrase. :) Is the compiler that smart to do such a thing? Plus, these are variables, which means their values aren't known at compile-time. How could the compiler do anything abt this? — user1508893, Aug 04 '12 at 20:25
Put that 2 statements in a loop and measure the performance. — juergen d, Aug 04 '12 at 20:26
Depends on the processor & instruction set. On an array processor with fast multiplies and expensive branches the multiply scheme may be faster. But some instruction sets have boolean instructions that can do the `||` functionality without branches. In any event, the multiply scheme is likely not justified based on how badly it obscures the meaning of the code, vs the very tenuous likelihood of a miniscule performance improvement in some environments. — Hot Licks, Aug 04 '12 at 20:26
I tried profiling it, the results aren't consistent. But I'd like to hear people's opinions. :) — user1508893, Aug 04 '12 at 20:27
Yes, there are all integers and their absolute values are guaranteed to be less than or equal to 100; In other words, the range is [-100, 100] — user1508893, Aug 04 '12 at 20:44
I consistently get lower times with method #1 (with small integers), but I have to do 1 billion iterations to see even milliseconds of difference. When you are using small integers it seems though method #1 would be faster, but with larger integers, method #2 would be quicker. — trousyt, Aug 04 '12 at 20:50
This isn't even really benchmarkable - not without having a specific usage to test. How well the second one will do (assuming the short-circuiting is kept by the compiler) depends completely on how well branch prediction will work, and that depends on the values the patterns in how they change. — harold, Aug 04 '12 at 20:50
@TroyParkinson the performance of multiplication doesn't depend on the values being multiplied on any platform I know. — harold, Aug 04 '12 at 20:55
@harold Perhaps he meant larger as in `sizeof(T1) > sizeof(T2)`. — Daniel Fischer, Aug 04 '12 at 20:57
Assuming they're of type `int`, you can't portably ignore the possibility of overflow. `INT_MAX` is only guaranteed to be at least 32767. (These days, it's more likely to be 2147483647.) — Keith Thompson, Aug 04 '12 at 21:31
There is a trick that really does work (not in this specific case, but in the more general sense of "a replacement of 4 comparisons"): http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord — harold, Aug 04 '12 at 21:51
Broadening the horizon slightly, I wonder if the bitwise operation in `if((a | b | c | d) == 0)` or `if(!(a | b | c | d))` might be faster than either of the alternatives in the question? — Beetroot-Beetroot, Aug 05 '12 at 04:49
@Beetroot-Beetroot IMHO, bitwise operation would actually be slower because it doesn't have short-circuits. But I might be wrong? — user1508893, Aug 06 '12 at 01:50

Keith Thompson · Accepted Answer · 2012-08-04T23:00:40.277

The C standard says nothing about performance. The question of whether

if ( a * b * c * d == 0 )

is faster than

if (a == 0 || b == 0 || c == 0 || d == 0)

is meaningful only in the context of a particular compiler generating code running on a particular machine. The only real way to compare them is to measure the performance on your own system, or on whatever system you're interested in.

Still, we can speculate about what the performance is likely to be.

As you said, a, b, c, and d are objects of type int. You also said they're in the range [-100,+100] -- but the compiler doesn't necessarily know that.

A compiler is free to replace any expression with code that does the same thing.

Multiplication is a relatively complex operation, and is likely to be slower than, say, addition or comparison. A compiler could recognize that the first condition will be true if any of the four variables has the value 0, and replace the multiplications with whatever happens to be faster. But each optimization a compiler performs has to be explicitly programmed by the compiler's developers, and this particular pattern isn't likely to be common enough for it to be worth the effort of recognizing it.

You say the values are small enough that overflow isn't an issue. In fact, you can't portably make that assumption; INT_MAX can be as small as 32767. But the compiler knows how big an int is on the system for which it's generating code. Still, unless it has information about the values of a, b, c, and d, it can't assume that there will be no overflow.

Except that yes, actually, it can make that assumption. The behavior of signed integer overflow is undefined. That gives an optimizing compiler permission to assume that overflow can't occur (if it does, whatever behavior the program exhibits is valid anyway).

So yes, a compiler could replace the multiplications with something simpler, but it's not likely to do so.

As for the other expression, a == 0 || b == 0 || c == 0 || d == 0, the || operator has short-circuit semantics; if the left operand is true (non-zero), then the right operand isn't evaluated. And that kind of conditional code can create performance issues due to CPU pipeline issues. Since none of the subexpressions have side effects (assuming none of the variables are declared volatile), the compiler can evaluate all four subexpressions, perhaps in parallel, if that's faster.

A quick experiment shows that gcc -O3 for x86 doesn't perform either optimization. For the first expression, it generates code that performs three multiplications. For the second, it generates conditional branches, implementing the canonical short-circuit evaluations (I don't know whether avoiding that would be faster or not).

Your best bet is to write reasonable code that's as straightforward as possible, both because it makes your source code easier to read and maintain, and because it's likely to give the compiler a better chance to recognize patterns and perform optimizations. If you try to do fancy micro-optimizations in your source code, you're as likely to hinder the compiler's optimizations as you are to help.

Don't worry too much about how fast your code is unless you've measured it and found it to be too slow. If you need your code to be faster, first concentrate on improved algorithms and data structures. And only if that fails, consider source-level micro-optimizations.

The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.

-- Michael A. Jackson

I did specify what `a, b, c, d` are defined and even gave a range for their values : [-100, 100]. But I see what you're saying. thanks — user1508893, Aug 04 '12 at 22:45

Yakov Galka · Answer 2 · 2012-08-04T20:59:31.570

8

The two are not equivalent. For example on my machine (32-bit x86 MSVC) if a, b, c and d are all equal to 0x100 then the first test will pass but the second condition will not.

Also note that multiplication is a costly operation, so the first version won't necessarily be faster.

EDIT: Code generated for the first version:

00401000 8B 44 24 04      mov         eax,dword ptr [esp+4] 
00401004 0F AF 44 24 08   imul        eax,dword ptr [esp+8] 
00401009 0F AF 44 24 0C   imul        eax,dword ptr [esp+0Ch] 
0040100E 0F AF 44 24 10   imul        eax,dword ptr [esp+10h] 
00401013 85 C0            test        eax,eax 
00401015 75 07            jne         f1+1Eh (40101Eh) 
00401017 ...

Code generated for the second version:

00401020 83 7C 24 04 00   cmp         dword ptr [esp+4],0 
00401025 74 15            je          f2+1Ch (40103Ch) 
00401027 83 7C 24 08 00   cmp         dword ptr [esp+8],0 
0040102C 74 0E            je          f2+1Ch (40103Ch) 
0040102E 83 7C 24 0C 00   cmp         dword ptr [esp+0Ch],0 
00401033 74 07            je          f2+1Ch (40103Ch) 
00401035 83 7C 24 10 00   cmp         dword ptr [esp+10h],0 
0040103A 75 07            jne         f2+23h (401043h) 
0040103C ...

Benchmarks on my machine (in nanoseconds): the first version runs in about 1.83 ns and the second in about 1.39 ns. The values of a, b, c and d didn't change during each run, so apparently the branch predictor could predict 100% of the branches.

edited Aug 04 '12 at 20:59

answered Aug 04 '12 at 20:28

Yakov Galka

70,775
16
139
220

I don't understand it (mathematically). If `a, b, c and d` are equal to `0x100` (meaning `1`, right?) Then how could the result of multiplying them together possibly be `0` ? – user1508893 Aug 04 '12 at 20:35
1

@user1508893 0x100 is 256, not 1. x is not multiplication, 0x is the hex prefix. – harold Aug 04 '12 at 20:37
@user1508893 - because of overflow – Agnius Vasiliauskas Aug 04 '12 at 20:37
@user1508893 No, 0x100 is 256. Multiply that by itself four times and you will have an integer overflow. Which the program won't notice. – Mr Lister Aug 04 '12 at 20:40
3

These variables are int32, and their values are guaranteed to be less than or equal to 100 – user1508893 Aug 04 '12 at 20:47
100 is not 0x100, so @user1508893 is correct in stating overflow is impossible. – obataku Aug 05 '12 at 00:55
1

@veer: user1508893 added this restriction to the question after I pointed out that in case of overflow the two are not equivalent. – Yakov Galka Aug 05 '12 at 06:28

old_timer · Answer 3 · 2012-08-04T21:39:59.057

So as usual with which is faster questions, is what have you tried so far? Did you compile and disassemble and see what happens?

unsigned int mfun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d )
{
    if ( a * b * c * d == 0 ) return(7);
    else return(11);
}

unsigned int ofun ( unsigned int a, unsigned int b, unsigned int c, unsigned int d )
{
    if (a == 0 || b == 0 || c == 0 || d == 0) return(7);
    else return(11);
}

for arm one compiler gives this

00000000 <mfun>:
   0:   e0010190    mul r1, r0, r1
   4:   e0020291    mul r2, r1, r2
   8:   e0110293    muls    r1, r3, r2
   c:   13a0000b    movne   r0, #11
  10:   03a00007    moveq   r0, #7
  14:   e12fff1e    bx  lr

00000018 <ofun>:
  18:   e3500000    cmp r0, #0
  1c:   13510000    cmpne   r1, #0
  20:   0a000004    beq 38 <ofun+0x20>
  24:   e3520000    cmp r2, #0
  28:   13530000    cmpne   r3, #0
  2c:   13a0000b    movne   r0, #11
  30:   03a00007    moveq   r0, #7
  34:   e12fff1e    bx  lr
  38:   e3a00007    mov r0, #7
  3c:   e12fff1e    bx  lr

so the equals and ors have short circuits (which are themselves costly) but the worst path takes longer so the performance is erratic, the multiply performance is more deterministic and less erratic. By inspection the multiply solution should be faster for the above code.

mips gave me this

00000000 <mfun>:
   0:   00a40018    mult    a1,a0
   4:   00002012    mflo    a0
    ...
  10:   00860018    mult    a0,a2
  14:   00002012    mflo    a0
    ...
  20:   00870018    mult    a0,a3
  24:   00002012    mflo    a0
  28:   10800003    beqz    a0,38 <mfun+0x38>
  2c:   00000000    nop
  30:   03e00008    jr  ra
  34:   2402000b    li  v0,11
  38:   03e00008    jr  ra
  3c:   24020007    li  v0,7

00000040 <ofun>:
  40:   10800009    beqz    a0,68 <ofun+0x28>
  44:   00000000    nop
  48:   10a00007    beqz    a1,68 <ofun+0x28>
  4c:   00000000    nop
  50:   10c00005    beqz    a2,68 <ofun+0x28>
  54:   00000000    nop
  58:   10e00003    beqz    a3,68 <ofun+0x28>
  5c:   00000000    nop
  60:   03e00008    jr  ra
  64:   2402000b    li  v0,11
  68:   03e00008    jr  ra
  6c:   24020007    li  v0,7

Unless the branches are too costly the equals and ors looks faster.

Openrisc 32

00000000 <mfun>:
   0:   e0 64 1b 06     l.mul r3,r4,r3
   4:   e0 a3 2b 06     l.mul r5,r3,r5
   8:   e0 c5 33 06     l.mul r6,r5,r6
   c:   bc 26 00 00     l.sfnei r6,0x0
  10:   0c 00 00 04     l.bnf 20 <mfun+0x20>
  14:   9d 60 00 0b     l.addi r11,r0,0xb
  18:   44 00 48 00     l.jr r9
  1c:   15 00 00 00     l.nop 0x0
  20:   44 00 48 00     l.jr r9
  24:   9d 60 00 07     l.addi r11,r0,0x7

00000028 <ofun>:
  28:   e0 e0 20 02     l.sub r7,r0,r4
  2c:   e0 87 20 04     l.or r4,r7,r4
  30:   bd 64 00 00     l.sfgesi r4,0x0
  34:   10 00 00 10     l.bf 74 <ofun+0x4c>
  38:   e0 80 18 02     l.sub r4,r0,r3
  3c:   e0 64 18 04     l.or r3,r4,r3
  40:   bd 63 00 00     l.sfgesi r3,0x0
  44:   10 00 00 0c     l.bf 74 <ofun+0x4c>
  48:   e0 60 30 02     l.sub r3,r0,r6
  4c:   e0 c3 30 04     l.or r6,r3,r6
  50:   bd 66 00 00     l.sfgesi r6,0x0
  54:   10 00 00 08     l.bf 74 <ofun+0x4c>
  58:   e0 60 28 02     l.sub r3,r0,r5
  5c:   e0 a3 28 04     l.or r5,r3,r5
  60:   bd 85 00 00     l.sfltsi r5,0x0
  64:   0c 00 00 04     l.bnf 74 <ofun+0x4c>
  68:   9d 60 00 0b     l.addi r11,r0,0xb
  6c:   44 00 48 00     l.jr r9
  70:   15 00 00 00     l.nop 0x0
  74:   44 00 48 00     l.jr r9
  78:   9d 60 00 07     l.addi r11,r0,0x7

this depends on the implementation of multiply, if it is one clock then the multiplies have it.

If your hardware doesnt support multiply then you have to make a call to have it simulated

00000000 <mfun>:
   0:   0b 12           push    r11     
   2:   0a 12           push    r10     
   4:   09 12           push    r9      
   6:   09 4d           mov r13,    r9  
   8:   0b 4c           mov r12,    r11 
   a:   0a 4e           mov r14,    r10 
   c:   0c 4f           mov r15,    r12 
   e:   b0 12 00 00     call    #0x0000 
  12:   0a 4e           mov r14,    r10 
  14:   0c 49           mov r9, r12 
  16:   b0 12 00 00     call    #0x0000 
  1a:   0a 4e           mov r14,    r10 
  1c:   0c 4b           mov r11,    r12 
  1e:   b0 12 00 00     call    #0x0000 
  22:   0e 93           tst r14     
  24:   06 24           jz  $+14        ;abs 0x32
  26:   3f 40 0b 00     mov #11,    r15 ;#0x000b
  2a:   39 41           pop r9      
  2c:   3a 41           pop r10     
  2e:   3b 41           pop r11     
  30:   30 41           ret         
  32:   3f 40 07 00     mov #7, r15 ;#0x0007
  36:   39 41           pop r9      
  38:   3a 41           pop r10     
  3a:   3b 41           pop r11     
  3c:   30 41           ret         

0000003e <ofun>:
  3e:   0f 93           tst r15     
  40:   09 24           jz  $+20        ;abs 0x54
  42:   0e 93           tst r14     
  44:   07 24           jz  $+16        ;abs 0x54
  46:   0d 93           tst r13     
  48:   05 24           jz  $+12        ;abs 0x54
  4a:   0c 93           tst r12     
  4c:   03 24           jz  $+8         ;abs 0x54
  4e:   3f 40 0b 00     mov #11,    r15 ;#0x000b
  52:   30 41           ret         
  54:   3f 40 07 00     mov #7, r15 ;#0x0007
  58:   30 41

You would hope that the two are equivalent, and from a pure mathematical sense they should be, to get a result of the multiplies to be zero one operand needs to be zero. problem is this is software for a processor, you can easily overflow on a multiply and have non-zero operands and still get zero so to properly implement the code the multiplies have to happen.

because of the cost of mul and divide in particular you should avoid them as much as possible in your software, your multiply solution in this case for the two solutions to be equivalent would require even more code to detect or prevent the overflow cases that can lead to a false positive. Yes, many processors perform mul in one clock, and divide as well, the reason why you dont see divide, and sometimes dont see mul implemented in the instruction set is because the chip real estate required, the expense is now power, heat, the cost of the part, etc. So mul and divide remain expensive, not limited to these of course but they do create long poles in the tent as to the performance of the part, the clock rate, folks want single clock operation not realizing that one instruction may slow the whole chip down, allowing it to be multi-clock might bring your overall clock rate up. so many things are long poles in the tent, so removing mul might not change performance, it all depends...

`(a|b|c|d) == 0` tests whether *all* of them are 0, rather than any, though. — harold, Aug 04 '12 at 21:28
Awesome detail and explanation. Many people only consider one processor and one compiler when asked about this. — dsh, Aug 04 '12 at 21:51
performance on x86, for example, is not an interesting discussion as the same code will perform very differently on the wide array of x86 processors in the field. Not possible to make a binary that is fast everywhere, you aim for average speed to get average performance across the board and to avoid bad performance in some places. Or tune for one, get fast there, and expect it to be slow on at least some other generations. — old_timer, Aug 04 '12 at 22:28

TOC · Answer 4 · 2012-08-05T02:49:25.100

0

Yes when the if instruction fail, cause in this case we do at most 4 comparisons (Operations) in the second instruction, and for the first instruction we always do 4 operations.

Edit : Explanation

The second if instruction is always faster than the first one:

Suppose that : a = 1, b =2, c =0 and d = 4, in this case :

For the first instruction : we have 3 multiplications and a comparison = 4 operations
For the second if instruction : we compare a to 0 (result KO) then b to 0 (again KO) and c to 0 (OK) = 3 operations.

This is a simple program that output the execution time for this 2 instructions, you can modify a, b, c and d and passé the number of the instruction as argument.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* This is a test program to demonstrate that the second if is faster always than the first one*/
int main(int argc, char **argv)
{
        int i;
        int a = 1;
        int b = 2;
        int c = 0;
        int d = 4;
        int instruction_number;
        clock_t begin, end;
        double time_spent;

        begin = clock();

        if (argc != 2)
        {
                fprintf(stderr, "Usage : ./a.out if_instruction_number (1 or 2)\n");

                exit(EXIT_FAILURE);
        }

        instruction_number = atoi(argv[1]);

        for (i = 1; i < 100000; i++)
        {
                switch (instruction_number)
                {
                        case 1:
                                fprintf(stdout, "First if instruction : \n");
                                if (a * b * c * d == 0)
                                        fprintf(stdout, "1st instruction\n");
                                break;
                        case 2:
                                fprintf(stdout, "Second if instruction : \n");
                                if (a == 0 || b == 0 || c == 0 || d == 0)
                                        fprintf(stdout, "2nd instruction\n");
                                break;
                        default:
                                break;
                }
        }
        end = clock();
        time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
        fprintf(stdout, "Time to accomplish %d instruction ---> %f\n", instruction_number, time_spent);

        return 0;
}

Hope this help.

Regards.

edited Aug 05 '12 at 02:49

answered Aug 04 '12 at 20:26

TOC

4,326
18
21

And how do you know one set of operations takes the same time as another set of 4 operations? (Hint: You don't.) – Keith Thompson Aug 04 '12 at 23:02
@KeithThompson : Yes i can tell that the second instruction is always faster than the first one (see my edit) – TOC Aug 04 '12 at 23:48
You can tell that it's faster *on your system*, that's all. – Keith Thompson Aug 05 '12 at 01:21
@KeithThompson : Well, i don't have another systems, but it's clear that it's faster (the number of operations in the second if is less than the first), no need to other systems to demonstrate that! – TOC Aug 05 '12 at 01:24
You're assuming that each "operation" takes the same amount of time to execute. That's not a safe assumption. – Keith Thompson Aug 05 '12 at 01:25
@KeithThompson : if you find a system where the first instruction is faster let me know! – TOC Aug 05 '12 at 01:31
I didn't say I know the first is faster. I said you *don't* know the second one is always faster. – Keith Thompson Aug 05 '12 at 01:42
Well, with your test program, it's clear that option 2 is faster, you forgot the `break;`, so option 1 does both tests and all writes. Also, with optimisations, the tests are computed at compile time and you get two unconditional write loops. But anyway, the output takes so much more time than the tests that timing this is meaningless. – Daniel Fischer Aug 05 '12 at 02:34
@DanielFischer: Thanks for default and break, still faster. – TOC Aug 05 '12 at 02:50
Not here, but as I said the time for printing is so much larger than the time for the tests that it's meaningless. If you want results that mean something, do something very cheap controlled by the test, use far more iterations, and don't make the numbers known at compile time. – Daniel Fischer Aug 05 '12 at 02:57

score 0 · Answer 5 · answered Aug 04 '12 at 20:32

0

if ( a * b * c * d == 0 ) compiles to (without optimizations)

movl   16(%esp), %eax
imull  20(%esp), %eax
imull  24(%esp), %eax
imull  28(%esp), %eax
testl  %eax, %eax
jne .L3

and if (a == 0 || b == 0 || c == 0 || d == 0) compiles to

cmpl   $0, 16(%esp)
je  .L2
cmpl    $0, 20(%esp)
je  .L2
cmpl    $0, 24(%esp)
je .L2
cmpl    $0, 28(%esp)
jne .L4

answered Aug 04 '12 at 20:32

InternetSeriousBusiness

2,605
16
17

Is that with optimisation on? – Mr Lister Aug 04 '12 at 20:41
8

There's not much point comparing generated code without optimisations. – Paul R Aug 04 '12 at 20:42

Arithmetic vs boolean operations

5 Answers5