Just using plain int
s for a
, b
and c
will allow you to change the code to the simpler
if (a >= b - c && a <= b + c) ...
Also, as an alternative, 256*256*256 is just 16M and a map of 16M bits is 2 MBytes. This means that it's feasible to use a lookup table like
int index = (a<<16) + (b<<8) + c;
if (lookup_table[index>>3] & (1<<(index&7))) ...
but I think that the cache trashing will make this much slower even if modern processors hate conditionals...
Another alternative is to use a bit of algebra
b - c <= a <= b + c
iff
- c <= a - b <= c (subtracted b from all terms)
iff
0 <= a - b + c <= 2*c (added c to all terms)
this allows to use just one test
if ((unsigned)(a - b + c) < 2*c) ...
assuming that a
, b
and c
are plain int
s. The reason is that if a - b + c
is negative then unsigned arithmetic will make it much bigger than 2*c
(if c
is 0..255).
This should generate efficent machine code with a single branch if the processor has dedicated signed/unsigned comparison instructions like x86 (ja/jg).
#include <stdio.h>
int main()
{
int err = 0;
for (int ia=0; ia<256; ia++)
for (int ib=0; ib<256; ib++)
for (int ic=0; ic<256; ic++)
{
unsigned char a = ia;
unsigned char b = ib;
unsigned char c = ic;
int res1 = (a >= ( b - c > 0 ? b - c : 0 ) &&
a <= ( b + c < 255 ? b + c : 255 ));
int res2 = (unsigned(a - b + c) <= 2*c);
err += (res1 != res2);
}
printf("Errors = %i\n", err);
return 0;
}
On x86 with g++ the assembler code generated for the res2
test only includes one conditional instruction.
The assembler code for the following loop is
void process(unsigned char *src, unsigned char *dst, int sz)
{
for (int i=0; i<sz; i+=3)
{
unsigned char a = src[i];
unsigned char b = src[i+1];
unsigned char c = src[i+2];
dst[i] = (unsigned(a - b + c) <= 2*c);
}
}
.L3:
movzbl 2(%ecx,%eax), %ebx ; This loads c
movzbl (%ecx,%eax), %edx ; This loads a
movzbl 1(%ecx,%eax), %esi ; This loads b
leal (%ebx,%edx), %edx ; This computes a + c
addl %ebx, %ebx ; This is c * 2
subl %esi, %edx ; This is a - b + c
cmpl %ebx, %edx ; Comparison
setbe (%edi,%eax) ; Set 0/1 depending on result
addl $3, %eax ; next group
cmpl %eax, 16(%ebp) ; did we finish ?
jg .L3 ; if not loop back for next
Using instead dst[i] = (a<b ? b-a : a-b);
the code becomes much longer
.L9:
movzbl %dl, %edx
andl $255, %esi
subl %esi, %edx
.L4:
andl $255, %edi
cmpl %edi, %edx
movl 12(%ebp), %edx
setle (%edx,%eax)
addl $3, %eax
cmpl %eax, 16(%ebp)
jle .L6
.L5:
movzbl (%ecx,%eax), %edx
movb %dl, -13(%ebp)
movzbl 1(%ecx,%eax), %esi
movzbl 2(%ecx,%eax), %edi
movl %esi, %ebx
cmpb %bl, %dl
ja .L9
movl %esi, %ebx
movzbl %bl, %edx
movzbl -13(%ebp), %ebx
subl %ebx, %edx
jmp .L4
.p2align 4,,7
.p2align 3
.L6:
And I'm way too tired now to try to decipher it (2:28 AM here)
Anyway longer doesn't mean necessarely slower (at a first sight seems g++ decided to unroll the loop by writing a few elements at a time in this case).
As I said before you should do some actual profiling with your real computation and your real data. Note that if true performance is needed may be that the best strategy will differ depending on the processor.
For example Linux during bootstrap makes ae test to decide what is the faster way to perform a certain computation that is needed in the kernel. The variables are just too many (cache size/levels, ram speed, cpu clock, chipset, cpu type...).