1

That is, what is the fastest way to do the test

if( a >= ( b - c > 0 ? b - c : 0 ) &&
    a <= ( b + c < 255 ? b + c : 255 ) )

    ...

if a, b, and c are all unsigned char aka BYTE. I am trying to optimize an image scanning process to find a sub-image, and a comparison such as this is done about 3 million times per scan, so even minor optimizations could be helpful.

Not sure, but maybe some sort of bitwise operation? Maybe adding 1 to c and testing for less-than and greater-than without the or-equal-to part? I don't know!

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
rotanimod
  • 518
  • 4
  • 13
  • do you mean a <= b <= c? – Jason D May 07 '11 at 23:18
  • No I mean a is within c units of b, ( a >= b - c AND a <= b + c ) – rotanimod May 07 '11 at 23:19
  • Generally c is rather small, and a and b can be whatever. a and b represent R-, G-, or B-pixel values, and c is the acceptable threshold they must fall within to 'match'. I have this threshold to account for equivalent images that are very slightly differently shaded and only off by a small-interval of pixel-value. – rotanimod May 07 '11 at 23:22

4 Answers4

4

Well, first of all let's see what you are trying to check without all kinds of over/underflow checks:

a >= b - c
a <= b + c
subtract b from both:
a - b >= -c
a - b <= c

Now that is equal to

abs(a - b) <= c

And in code:

(a>b ? a-b : b-a) <= c

Now, this code is a tad faster and doesn't contain (or need) complicated underflow/overflow checks.


I've profiled mine and 6502's code with 1000000000 repitions and there officially was no difference whatsoever. I would suggest to pick the most elegant solution (which is IMO mine, but opinions differ), since performance is not an argument.


However, there was a notable difference between my and the asker's code. This is the profiling code I used:

#include <iostream>

int main(int argc, char *argv[]) {  
    bool prevent_opti;
    for (int ai = 0; ai < 256; ++ai) {
        for (int bi = 0; bi < 256; ++bi) {
            for (int ci = 0; ci < 256; ++ci) {
                unsigned char a = ai;
                unsigned char b = bi;
                unsigned char c = ci;
                if ((a>b ? a-b : b-a) <= c) prevent_opti = true;
            }
        }
    }

    std::cout << prevent_opti << "\n";

    return 0;
}

With my if statement this took 120ms on average and the asker's if statement took 135ms on average.

jweyrich
  • 31,198
  • 5
  • 66
  • 97
orlp
  • 112,504
  • 36
  • 218
  • 315
  • @Jason D: What does it mean to steal someone's thunder? English is not my mother language. But I guess it's something extremely positive :) – orlp May 07 '11 at 23:28
  • Beautiful. I knew there was something like this lurking around. Thanks. – rotanimod May 07 '11 at 23:31
  • This is still using two conditionals (one for the ternary operator and one for the comparison with `c`) – 6502 May 07 '11 at 23:47
  • @6502: And these conditionals are blazing fast! – orlp May 07 '11 at 23:49
  • @Martin: Except a cannot be negative since it is unsigned. – rotanimod May 08 '11 at 00:11
  • @Martin: `a` can __not__ be -2 since it's a unsigned char. -2 automatically gets converted to 254. Now think again for a(254) b(50) c(200). – orlp May 08 '11 at 00:11
  • Interesting, you guys rock, thanks for the benchmarking too. I think out of curiosity I'll benchmark my original comparison against yours to see how much of a real speed gain I'll get. – rotanimod May 08 '11 at 00:12
  • @Martin: since when -2 is between 0 and 255 ? – 6502 May 08 '11 at 00:13
  • @rotanimod: I profiled yours against mine and it will be a speed gain of 15ms (Ubuntu 10.10, Q6600 single-threaded app, g++ -O3). – orlp May 08 '11 at 00:18
  • I think this question is nearly answered to death. It's tough to pick a winner as you guys have all given pretty solid tips/information. I like the 1-less comparison from 6502, but since performance is equivalent and yours is a little more intuitive, I'm going with it. – rotanimod May 08 '11 at 00:26
  • @nightcracker: Removed comment. Obviously it works now that I understand. – Martin York May 08 '11 at 20:22
  • @Tomalak Geret'kal: Telling a blind man the light is on does not help. You must first help him to see before he knows the light is on! – Martin York Sep 11 '11 at 19:56
2

Just using plain ints for a, b and c will allow you to change the code to the simpler

if (a >= b - c && a <= b + c) ...

Also, as an alternative, 256*256*256 is just 16M and a map of 16M bits is 2 MBytes. This means that it's feasible to use a lookup table like

int index = (a<<16) + (b<<8) + c;
if (lookup_table[index>>3] & (1<<(index&7))) ...

but I think that the cache trashing will make this much slower even if modern processors hate conditionals...

Another alternative is to use a bit of algebra

b - c <= a <= b + c
      iff
- c <= a - b <= c        (subtracted b from all terms)
      iff
0 <= a - b + c <= 2*c    (added c to all terms)

this allows to use just one test

if ((unsigned)(a - b + c) < 2*c) ...

assuming that a, b and c are plain ints. The reason is that if a - b + c is negative then unsigned arithmetic will make it much bigger than 2*c (if c is 0..255). This should generate efficent machine code with a single branch if the processor has dedicated signed/unsigned comparison instructions like x86 (ja/jg).

#include <stdio.h>

int main()
{
    int err = 0;

    for (int ia=0; ia<256; ia++)
        for (int ib=0; ib<256; ib++)
            for (int ic=0; ic<256; ic++)
            {
                unsigned char a = ia;
                unsigned char b = ib;
                unsigned char c = ic;
                int res1 = (a >= ( b - c > 0 ? b - c : 0 ) &&
                            a <= ( b + c < 255 ? b + c : 255 ));
                int res2 = (unsigned(a - b + c) <= 2*c);

                err += (res1 != res2);
            }
    printf("Errors = %i\n", err);
    return 0;
}

On x86 with g++ the assembler code generated for the res2 test only includes one conditional instruction.

The assembler code for the following loop is

void process(unsigned char *src, unsigned char *dst, int sz)
{
    for (int i=0; i<sz; i+=3)
    {
        unsigned char a = src[i];
        unsigned char b = src[i+1];
        unsigned char c = src[i+2];
        dst[i] = (unsigned(a - b + c) <= 2*c);
    }
}


.L3:
    movzbl  2(%ecx,%eax), %ebx    ; This loads c
    movzbl  (%ecx,%eax), %edx     ; This loads a
    movzbl  1(%ecx,%eax), %esi    ; This loads b
    leal    (%ebx,%edx), %edx     ; This computes a + c
    addl    %ebx, %ebx            ; This is c * 2
    subl    %esi, %edx            ; This is a - b + c
    cmpl    %ebx, %edx            ; Comparison
    setbe   (%edi,%eax)           ; Set 0/1 depending on result
    addl    $3, %eax              ; next group
    cmpl    %eax, 16(%ebp)        ; did we finish ?
    jg  .L3                   ; if not loop back for next

Using instead dst[i] = (a<b ? b-a : a-b); the code becomes much longer

.L9:
    movzbl  %dl, %edx
    andl    $255, %esi
    subl    %esi, %edx
.L4:
    andl    $255, %edi
    cmpl    %edi, %edx
    movl    12(%ebp), %edx
    setle   (%edx,%eax)
    addl    $3, %eax
    cmpl    %eax, 16(%ebp)
    jle .L6
.L5:
    movzbl  (%ecx,%eax), %edx
    movb    %dl, -13(%ebp)
    movzbl  1(%ecx,%eax), %esi
    movzbl  2(%ecx,%eax), %edi
    movl    %esi, %ebx
    cmpb    %bl, %dl
    ja  .L9
    movl    %esi, %ebx
    movzbl  %bl, %edx
    movzbl  -13(%ebp), %ebx
    subl    %ebx, %edx
    jmp .L4
    .p2align 4,,7
    .p2align 3
.L6:

And I'm way too tired now to try to decipher it (2:28 AM here)

Anyway longer doesn't mean necessarely slower (at a first sight seems g++ decided to unroll the loop by writing a few elements at a time in this case).

As I said before you should do some actual profiling with your real computation and your real data. Note that if true performance is needed may be that the best strategy will differ depending on the processor.

For example Linux during bootstrap makes ae test to decide what is the faster way to perform a certain computation that is needed in the kernel. The variables are just too many (cache size/levels, ram speed, cpu clock, chipset, cpu type...).

6502
  • 112,025
  • 15
  • 165
  • 265
  • @6502: How can you ever assume `a, b, c` are `int`s while the question explicitly states `unsigned char`. – orlp May 07 '11 at 23:37
  • I need to use unsigned char, though, because using int will quadruple the size of all my image objects and that is no good either. I like the looks of your second option... but I must admit I don't fully understand what the second line is doing. – rotanimod May 07 '11 at 23:38
  • If `a`, `b` and `c` are variables you can have them `int` but still having `unsigned char`s inside the images. `char`, `short` and `float` by the way are for C++ just "storage types": they're automatically converted to int/doubles in every expression (i.e. `int x = a + b;` when `a` and `b` are `unsigned char`s with value 255 gives 510 ... in other words there is no wrap even if it's a sum between unsigned chars). – 6502 May 07 '11 at 23:43
  • OK, so my understanding is that `if ((unsigned)(a - b + c) < 2*c) ...` should work regardless then? – rotanimod May 07 '11 at 23:45
  • Looks very nice. Could you post the full assembler dump for the res2 statement? I just see that my initial code has 13 asm statements, nightcrawler's has 8, and yours has an unknown number but one less compare for sure. So that seems like a likely winner. – rotanimod May 08 '11 at 00:09
  • @rotanimod: Offtopic, it's nightcracker or NC, but not nightcrawler. – orlp May 08 '11 at 00:14
  • @nightcracker: Whoops my bad, i'll fix that. – rotanimod May 08 '11 at 00:19
2

It think you will get the best performance by writing it in clearest way possible then turning on the compilers optimizers. The compiler is rather good at this kind of optimization and will beat you most of the time (in the worst case it will equal you).

My preference would be:

int min = (b-c) > 0  ? (b-c) : 0 ;
int max = (b+c) < 255? (b+c) : 255;

if ((a >= min) && ( a<= max))

The original code: (in assembley)

movl    %eax, %ecx
movl    %ebx, %eax
subl    %ecx, %eax
movl    $0, %edx
cmovs   %edx, %eax
cmpl    %eax, %r12d
jl  L13
leal    (%rcx,%rbx), %eax
cmpl    $255, %eax
movb    $-1, %dl
cmovg   %edx, %eax
cmpl    %eax, %r12d
jmp L13

My Code (in assembley)

movl    %eax, %ecx
movl    %ebx, %eax
subl    %ecx, %eax
movl    $0, %edx
cmovs   %edx, %eax
cmpl    %eax, %r12d
jl  L13
leal    (%rcx,%rbx), %eax
cmpl    $255, %eax
movb    $-1, %dl
cmovg   %edx, %eax
cmpl    %eax, %r12d
jg  L13

nightcracker's code (in assembley)

movl    %r12d, %edx
subl    %ebx, %edx
movl    %ebx, %ecx
subl    %r12d, %ecx
cmpl    %ebx, %r12d
cmovle  %ecx, %edx
cmpl    %eax, %edx
jg  L16
Martin York
  • 257,169
  • 86
  • 333
  • 562
  • I'm sorry martin, but your preference would take me 20 seconds to grasp. I think my `(a>b ? a-b : b-a) <= c` is faster and easier to read. If not then I would put this comment behind it: `// abs(a - b) <= c`. – orlp May 07 '11 at 23:32
  • Interesting approach, use verbosity so the compiler knows exactly what you are doing and can then use it's clever tricks to do it best. Still using so many operations makes me nervous. I should really compile all these and see which one has the best assembler. – rotanimod May 07 '11 at 23:40
  • @night: I think mine is easier to read. Its hard to grok that you you are checking that `a` has to be in the range. As for more efficient your is 5 instructions less (with -O3). – Martin York May 07 '11 at 23:43
  • 1
    @rotanimod: The verbosity is not for the compiler. It is for the maintainer. The compiler is good at sorting you mess out just fine. – Martin York May 07 '11 at 23:44
  • Given the difference I think you will find that there is no difference in speed in a real program as any processor stall in the rest of the code will completely wipe out any advantage. So the importance becomes for readability and maintainability. – Martin York May 07 '11 at 23:56
  • I dunno 13 operations to 8 seems pretty big. This is not a section of code I will change frequently and I can comment on the purpose. This really is a performance-critical section so I think it's definitely worth it. I am going to try to profile the assembly of 6502's solution as well. – rotanimod May 08 '11 at 00:00
  • @rotanimod: the only way to know if it's faster is to do actual profiling (and unfortunately this means that what is an optimization or not depends on CPU model). Modern CPU execution times are quite hard to predict and experimenting (with real data!) is the only sensible approach... this area of programming left math and became like physics. – 6502 May 08 '11 at 00:08
  • @rotanimod: The trouble with special optimizations like that are that they are hard to get correct in the first place, and nightcrawlers is no exception in being wrong. Try a(-2) b(50) c(200). Should be false his code will return true. I doubt the section is as performance critical as you think. Humans are notoriously bad at determining critical sections. Have you profiled your code? These kind of macro optimizations are usually a waste of time. It is more important to reduce the big O() complexity and let the compiler worry about the simple things. – Martin York May 08 '11 at 00:09
  • @Martin: Also for you, it's nightCRACKER, not nightcrawler. And my code example works perfectly for the stated domain, `0 <= (a,b,c) < pow(2,sizeof(char))`. – orlp May 08 '11 at 00:20
  • @nightcracker: Sorry. But also proves my point. Normal humans (like myself) should not be making these optimizations because we have a hard time validating the code. – Martin York May 08 '11 at 11:15
1

Rarely does embedding the ternary operator in another statement improve performance :)

If every single op code matters, write the op codes yourself - use assembler. Also consider using simd instructions if possible. I'd also be interested in the target platform. ARM assembler loves compares of this sort and has opcodes to speed up saturated math of this type.

Michael Dorgan
  • 12,453
  • 3
  • 31
  • 61
  • I have extremely little assembler experience. I think I have to do my best in C++ and then trust the compiler. But point well-taken, I should consider digging a bit deeper for some things where performance is huge, and this is one of them. – rotanimod May 07 '11 at 23:33