Fast Pixel Count on Binary Image- ARM neon intrinsics - iOS Dev

Question

Can someone tell me a fast function to count the number of white pixels in a binary image. I need it for iOS app dev. I am working directly on the memory of the image defined as

  bool *imageData = (bool *) malloc(noOfPixels * sizeof(bool));

I am implementing the function

             int whiteCount = 0;
             for (int q=i; q<i+windowHeight; q++)
             {
                 for (int w=j; w<j+windowWidth; w++)
                 { 
                     if (imageData[q*W + w] == 1)
                         whiteCount++;
                 }
             }

This is obviously the slowest function possible. I heard that ARM Neon intrinsics on the iOS can be used to make several operations in 1 cycle. Maybe thats the way to go ??

The problem is that I am not very familiar and don't have enough time to learn assembly language at the moment. So it would be great if anyone can post a Neon intrinsics code for the problem mentioned above or any other fast implementation in C/C++.

The only code in neon intrinsics that I am able to find online is the code for rgb to gray http://computer-vision-talks.com/2011/02/a-very-fast-bgra-to-grayscale-conversion-on-iphone/

Also, what are the possible values in `imageData[]` ? Is it just 0 or 1, or can there be other non-zero values ? — Paul R, Jan 16 '12 at 22:35

Paul R · Accepted Answer · 2012-01-17T15:09:54.720

Firstly you can speed up the original code a little by factoring out the multiply and getting rid of the branch:

 int whiteCount = 0;
 for (int q = i; q < i + windowHeight; q++)
 {
     const bool * const row = &imageData[q * W];

     for (int w = j; w < j + windowWidth; w++)
     { 
         whiteCount += row[w];
     }
 }

(This assumes that imageData[] is truly binary, i.e. each element can only ever be 0 or 1.)

Here is a simple NEON implementation:

#include <arm_neon.h>

// ...

int i, w;
int whiteCount = 0;
uint32x4_t v_count = { 0 };

for (q = i; q < i + windowHeight; q++)
{
    const bool * const row = &imageData[q * W];

    uint16x8_t vrow_count = { 0 };

    for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
    {
        uint8x16_t v = vld1q_u8(&row[j]);           // load 16 x 8 bit pixels
        vrow_count = vpadalq_u8(vrow_count, v);     // accumulate 16 bit row counts
    }
    for ( ; w < j + windowWidth; ++w)               // scalar clean up loop
    {
        whiteCount += row[j];
    }
    v_count = vpadalq_u16(v_count, vrow_count);     // update 32 bit image counts
}                                                   // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount

(This assumes that imageData[] is truly binary, imageWidth <= 2^19, and sizeof(bool) == 1.)

Updated version for unsigned char and values of 255 for white, 0 for black:

#include <arm_neon.h>

// ...

int i, w;
int whiteCount = 0;
const uint8x16_t v_mask = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
uint32x4_t v_count = { 0 };

for (q = i; q < i + windowHeight; q++)
{
    const uint8_t * const row = &imageData[q * W];

    uint16x8_t vrow_count = { 0 };

    for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
    {
        uint8x16_t v = vld1q_u8(&row[j]);           // load 16 x 8 bit pixels
        v = vandq_u8(v, v_mask);                    // mask out all but LS bit
        vrow_count = vpadalq_u8(vrow_count, v);     // accumulate 16 bit row counts
    }
    for ( ; w < j + windowWidth; ++w)               // scalar clean up loop
    {
        whiteCount += (row[j] == 255);
    }
    v_count = vpadalq_u16(v_count, vrow_count);     // update 32 bit image counts
}                                                   // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount

(This assumes that imageData[] is has values of 255 for white and 0 for black, and imageWidth <= 2^19.)

Note that all the above code is untested and may need some further work.

Sorry - a couple of typos in there (now fixed) - I should have mentioned that this is untested code so it may need some further work - I was just trying to present the general idea. — Paul R, Jan 17 '12 at 09:40
Sorry to be bothering again.. but I'm getting an error at the line uint8x16_t v = vld1q_u8(&row[j]); saying that – Cannot initialize a variable of const unit8_t * (aka unsigned char *) with value of type const bool *- Any idea what the problem might be ?? — shreyas253, Jan 17 '12 at 09:58
And I'm using a bool* for image data as I assumed it will be faster as it takes only 1 bit of memory per value — shreyas253, Jan 17 '12 at 10:00
@Shreyas: I asked you above what sizeof(bool) is and you didn't respond - it's usually either sizeof(char) or sizeof(int) - for now I've assumed it's sizeof(char) (i.e. 1 *byte*), but if that is not correct then either the code will need to be modified or you'll need to use a suitable 1 byte type (e.g. uint8_t). — Paul R, Jan 17 '12 at 10:05
k Thanks a lot.. Just found out that a bool variable also takes 1 byte to store. So I guess I'll just make the image as unsigned char. — shreyas253, Jan 17 '12 at 10:11
what would be the changes needed if the image was an unsigned char * image?? And I need to count the number of white pixels ie. if (imageData[i] == 255) — shreyas253, Jan 17 '12 at 14:19
It would be just a minor change - the data type is still one byte, but the 255 would need to be taken care of — Paul R, Jan 17 '12 at 15:03
I guess just divide the final whiteCount by 255 would be the simplest solution ?? — shreyas253, Jan 17 '12 at 15:08
No - that would make the implementation more prone to overflow - I've added an updated version above which handles the case where white == 255. — Paul R, Jan 17 '12 at 15:11

score 0 · Answer 2 · answered Jan 16 '12 at 22:38

http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

Section 6.55.3.6

The vectorized algorithm will do the comparisons and put them in a structure for you, but you'd still need to go through each element of the structure and determine if it's a zero or not.

How fast does that loop currently run and how fast do you need it to run? Also remember that NEON will work in the same registers as the floating point unit, so using NEON here may force an FPU context switch.

Fast Pixel Count on Binary Image- ARM neon intrinsics - iOS Dev

2 Answers2