Fast search/replace of matching single bytes in a 8-bit array, on ARM

Question

I develop image processing algorithms (using GCC, targeting ARMv7 (Raspberry Pi 2B)).

In particular I use a simple algorithm, which changes index in a mask:

void ChangeIndex(uint8_t * mask, size_t size, uint8_t oldIndex, uint8_t newIndex)
{
    for(size_t i = 0; i < size; ++i)
    {
        if(mask[i] == oldIndex)
            mask[i] = newIndex;
    }
}

Unfortunately it has poor performance for the target platform.

Is there any way to optimize it?

Not immediately obvious how to make that faster - there may be tricks if you know more about the data - for example, you could have a list of cells containing value `X` - but that's only really useful if the number of "hits" is fairly low - if you are hitting most entries in `mask` matching `oldIndex`, then it's unlikely to speed up. What value is `size` and how many percent of the table has value `oldIndex` on average? — Mats Petersson, Jan 28 '16 at 08:20
What compiler options are you using? Make sure that you've instructed it to use NEON instructions (`-mfpu=neon-vfpv4`, I think), otherwise it may be generating code compatible with older CPUs that don't have NEON. — Gilles 'SO- stop being evil', Jan 28 '16 at 12:28
You should also get some speedup using ternary operator: `mask[i] = (mask[i] == oldIndex) ? newIndex : mask[i];` — Miki, Jan 28 '16 at 12:37
@Miki: If you're lucky, the compiler will optimize that to be **not slower**. Realistically, it's significantly slower. This is _especially_ the case on ARM where simple if-statements like the original can be compiled into conditional moves. — MSalters, Jan 28 '16 at 21:34
@MSalters Good to know, thanks! It was just my 2 cents, since for me it works a little faster, but that probably depends on the compiler (I cannot test on ARM). Probably I just was lucky :D — Miki, Jan 28 '16 at 21:48

score 13 · Answer 1 · edited Jan 28 '16 at 15:18

13

The ARMv7 platform supports SIMD instructions called NEON. With use of them you can make you code faster:

#include <arm_neon.h>

void ChangeIndex(uint8_t * mask, size_t size, uint8_t oldIndex, uint8_t newIndex)
{
    size_t alignedSize = size/16*16, i = 0;

    uint8x16_t _oldIndex = vdupq_n_u8(oldIndex);
    uint8x16_t _newIndex = vdupq_n_u8(newIndex);

    for(; i < alignedSize; i += 16)
    {
        uint8x16_t oldMask = vld1q_u8(mask + i); // loading of 128-bit vector
        uint8x16_t condition = vceqq_u8(oldMask, _oldIndex); // compare two 128-bit vectors
        uint8x16_t newMask = vbslq_u8(condition, _newIndex, oldMask); // selective copying of 128-bit vector
        vst1q_u8(mask + i, newMask); // saving of 128-bit vector
    }

    for(; i < size; ++i)
    {
        if(mask[i] == oldIndex)
            mask[i] = newIndex;
    }
}

edited Jan 28 '16 at 15:18

Uyghur Lives Matter

18,820
42
108
144

answered Jan 28 '16 at 08:25

ErmIg

3,980
1
27
40

I checked your version of the algorithm. It works in 5 times faster than original version. Its great! – Jan 28 '16 at 11:40
You could achieve a further, minor, speed improvement by working directly with the `mask` pointer, rather than `mask+i`. First precalculate your endpoint `uint8_t* maskEnd = mask+i;` then change the for loops to work directly with your pointer, e.g. `for(; mask < maskEnd; ++mask)` and refer to mask directly rather than `mask[i]`. – Jack Aidley Jan 28 '16 at 11:47
You could probably make this even faster by writing the Neon assembly directly. IME GCC's Neon intrinsics are not very quick because they keep moving stuff between Neon and main registers, which stalls the pipeline. (Maybe they fixed that since I last used them, though.) – Dan Hulme Jan 28 '16 at 13:55
@JackAidley: I'm not familiar with the specific addressing modes that the ARMv7 instruction set provides, but your suggestion isn't always beneficial. Some instruction sets provide indexed addressing for essentially free (e.g. x86), so in those cases, the `mask + i` approach can actually be faster. It's worth a try looking at it both ways, though, if you have a particular hotspot. – Jason R Jan 28 '16 at 14:36
1

@JasonR: As always, profile your optimisations! My experience with working on ARM-based systems leads me to be confident an improvement would be seen, however. Dan's suggestion about assembly is probably correct too. – Jack Aidley Jan 28 '16 at 15:11

Fast search/replace of matching single bytes in a 8-bit array, on ARM

1 Answers1