If you're not too concerned with precision then this inner loop should give you twice the compute throughput compared to the more accurate algorithm:
for (i=0; i<640; i+= 32)
{
uint8x16x2_t a, b;
uint8x16_t c, d;
/* load upper row, splitting even and odd pixels into a.val[0]
* and a.val[1] respectively. */
a = vld2q_u8(src1);
/* as above, but for lower row */
b = vld2q_u8(src2);
/* compute average of even and odd pixel pairs for upper row */
c = vrhaddq_u8(a.val[0], a.val[1]);
/* compute average of even and odd pixel pairs for lower row */
d = vrhaddq_u8(b.val[0], b.val[1]);
/* compute average of upper and lower rows, and store result */
vst1q_u8(dest, vrhaddq_u8(c, d));
src1+=32;
src2+=32;
dest+=16;
}
It works by using the vhadd
operation, which has a result the same size as the input. This way you don't have to shift the final sum back down to 8-bit, and all of the arithmetic throughout is eight-bit, which means you can perform twice as many operations per instruction.
However it is less accurate, because the intermediate sum is quantised, and GCC 4.7 does a terrible job of generating code. GCC 4.8 does just fine.
The whole operation has a good chance of being I/O bound, though. The loop should be unrolled to maximise separation between loads and arithmetic, and __builtin_prefetch()
(or PLD
) should be used to hoist the incoming data into caches before it's needed.