5

I'm trying to optimize my code that converts grayscale images to float images which runs on Neon A64/v8.

The current implementation is quite fast using OpenCV's convertTo() (that compiled for android), but this is still our bottleneck.

So I came up with the following code and would like to hear about possible improvements.

The image height and width are a factor of 16 if it can help.

I'm running for loops on this:

static void u8_2_f(unsigned char* in, float* out)
{
    //1 u8x8->u16x8
    uint8x8_t u8x8src = vld1_u8(in);
    uint16x8_t u16x8src = vmovl_u8(u8x8src);

    //2 u16x8 -> u32x4high, u32x4low
    uint32x4_t u32x4srch = vmovl_u16(vget_high_u16(u16x8src));
    uint32x4_t u32x4srcl = vmovl_u16(vget_low_u16(u16x8src));

    //3 u32x4high, u32x4low -> f32x4high, f32x4low
    vst1q_f32(out, vcvtq_f32_u32(u32x4srch));
    vst1q_f32(out+4, vcvtq_f32_u32(u32x4srcl));
}
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
Chen
  • 51
  • 2
  • If memory bandwidth is a bottleneck, it might be worth doing this on the fly as part of some other pass over the image that's memory bottlenecked. Or as part of a pass that's ALU bottlenecked, and saving a float version of the image for later use along with using it while already loaded, so that pass is keeping memory busy as well as ALUs. Or maybe cache-blocking the conversion so you convert a part that fits in L1d or L2 cache, then loop over that with later passes. – Peter Cordes Aug 23 '20 at 20:26
  • One of the biggest reasons - if not THE biggest one - for writing assembly codes is `vget_`. The compilers generate FUBAR machine codes as soon as they see them. – Jake 'Alquimista' LEE Aug 25 '20 at 03:49

1 Answers1

1

For possible improvement, try to replace vcvtq_f32_u32 with this function. It's 2 instructions instead of 1, but they might be faster on some CPUs.

// Convert bytes to float, assuming the input is within [ 0 .. 0xFF ] interval
inline float32x4_t byteToFloat( uint32x4_t u32 )
{
    // Floats have 23 bits of mantissa.
    // We want least significant 8 bits to be shifted to [ 0 .. 255 ], therefore need to add 2^23
    // See this page for details: https://www.h-schmidt.net/FloatConverter/IEEE754.html
    // If you want output floats in [ 0 .. 255.0 / 256.0 ] interval, change into 2^15 = 0x47000000
    constexpr uint32_t offsetValue = 0x4b000000;
    // Check disassembly & verify your compiler has moved this initialization outside the loop
    const uint32x4_t offsetInt = vdupq_n_u32( offsetValue );
    // Bitwise is probably slightly faster than addition, delivers same results for our input
    u32 = vorrq_u32( u32, offsetInt );
    // The only FP operation required is subtraction, hopefully faster than UCVTF
    return vsubq_f32( vreinterpretq_f32_u32( u32 ), vreinterpretq_f32_u32( offsetInt ) );
}
Soonts
  • 20,079
  • 9
  • 57
  • 130