I have a program which spends most of its time computing the Euclidean distance between RGB values (3-tuples of unsigned 8-bit Word8
). I need a fast, branchless unsigned int absolute difference function such that
unsigned_difference :: Word8 -> Word8 -> Word8
unsigned_difference a b = max a b - min a b
in particular,
unsigned_difference a b == unsigned_difference b a
I came up with the following, using new primops from GHC 7.8:
-- (a < b) * (b - a) + (a > b) * (a - b)
unsigned_difference (I# a) (I# b) =
I# ((a <# b) *# (b -# a) +# (a ># b) *# (a -# b))]
which ghc -O2 -S
compiles to
.Lc42U:
movq 7(%rbx),%rax
movq $ghczmprim_GHCziTypes_Izh_con_info,-8(%r12)
movq 8(%rbp),%rbx
movq %rbx,%rcx
subq %rax,%rcx
cmpq %rax,%rbx
setg %dl
movzbl %dl,%edx
imulq %rcx,%rdx
movq %rax,%rcx
subq %rbx,%rcx
cmpq %rax,%rbx
setl %al
movzbl %al,%eax
imulq %rcx,%rax
addq %rdx,%rax
movq %rax,(%r12)
leaq -7(%r12),%rbx
addq $16,%rbp
jmp *(%rbp)
compiling with ghc -O2 -fllvm -optlo -O3 -S
produces the following asm:
.LBB6_1:
movq 7(%rbx), %rsi
movq $ghczmprim_GHCziTypes_Izh_con_info, 8(%rax)
movq 8(%rbp), %rcx
movq %rsi, %rdx
subq %rcx, %rdx
xorl %edi, %edi
subq %rsi, %rcx
cmovleq %rdi, %rcx
cmovgeq %rdi, %rdx
addq %rcx, %rdx
movq %rdx, 16(%rax)
movq 16(%rbp), %rax
addq $16, %rbp
leaq -7(%r12), %rbx
jmpq *%rax # TAILCALL
So LLVM manages to replace comparisons with (more efficient?) conditional move instructions. Unfortunately compiling with -fllvm
has little effect on the runtime of my program.
However, there are a two problems with this function.
- I want to compare
Word8
, but the comparison primops necessitate the use ofInt
. This causes needless allocation as I'm forced to store a 64-bitInt
rather than aWord8
.
I've profiled and confirmed that the use of fromIntegral :: Word8 -> Int
is responsible for 42.4 percent of the program's total allocations.
- My version uses 2 comparisons, 2 multiplications and 2 subtractions. I wonder if there is a more efficient method, using bitwise operations or SIMD instructions and exploiting the fact that I'm comparing
Word8
.
I had previously tagged the question C/C++
to attract attention from those more inclined to bit manipulation. My question uses Haskell, but I'd accept an answer implementing a correct method in any language.
Conclusion:
I've decided to use
w8_sad :: Word8 -> Word8 -> Int16
w8_sad a b = xor (diff + mask) mask
where diff = fromIntegral a - fromIntegral b
mask = unsafeShiftR diff 15
as it is faster than my original unsigned_difference
function, and simple to implement. SIMD intrinsics in Haskell haven't reached maturity yet. So, while the SIMD versions are faster, I decided to go with a scalar version.