Fastest way of bitwise AND between two arrays on iPhone?

Question

I have two image blocks stored as 1D arrays and have do the following bitwise AND operations among the elements of them.

int compare(unsigned char *a, int a_pitch, 
            unsigned char *b, int b_pitch, int a_lenx, int a_leny) 
{
    int overlap =0 ;

    for(int y=0; y<a_leny; y++) 
        for(int x=0; x<a_lenx; x++) 
        {
            if(a[x + y * a_pitch] & b[x+y*b_pitch]) 
                overlap++ ;
        }
    return overlap ;
}

Actually, I have to do this job about 220,000 times, so it becomes very slow on iphone devices.

How could I accelerate this job on iPhone ?

I heard that NEON could be useful, but I'm not really familiar with it. In addition it seems that NEON doesn't have bitwise AND...

score 2 · Answer 1 · answered Jun 14 '11 at 03:49

Option 1 - Work in the native width of your platform (it's faster to fetch 32-bits into a register and then do operations on that register than it is to fetch and compare data one byte at a time):

int compare(unsigned char *a, int a_pitch, 
            unsigned char *b, int b_pitch, int a_lenx, int a_leny) 
{
    int overlap = 0;
    uint32_t* a_int = (uint32_t*)a;
    uint32_t* b_int = (uint32_t*)b;

    a_leny = a_leny / 4;
    a_lenx = a_lenx / 4;
    a_pitch = a_pitch / 4;
    b_pitch = b_pitch / 4;

    for(int y=0; y<a_leny_int; y++) 
        for(int x=0; x<a_lenx_int; x++) 
        {
            uint32_t aVal = a_int[x + y * a_pitch_int];
            uint32_t bVal = b_int[x+y*b_pitch_int];
            if (aVal & 0xFF) & (bVal & 0xFF)
                overlap++;
            if ((aVal >> 8) & 0xFF) & ((bVal >> 8) & 0xFF)
                overlap++;
            if ((aVal >> 16) & 0xFF) & ((bVal >> 16) & 0xFF)
                overlap++;
            if ((aVal >> 24) & 0xFF) & ((bVal >> 24) & 0xFF)
                overlap++;
        }
    return overlap ;
}

Option 2 - Use a heuristic to get an approximate result using fewer calculations (a good approach if the absolute difference between 101 overlaps and 100 overlaps is not important to your application):

int compare(unsigned char *a, int a_pitch, 
            unsigned char *b, int b_pitch, int a_lenx, int a_leny) 
{
    int overlap =0 ;

    for(int y=0; y<a_leny; y+= 10) 
        for(int x=0; x<a_lenx; x+= 10) 
        {
            //we compare 1% of all the pixels, and use that as the result
            if(a[x + y * a_pitch] & b[x+y*b_pitch]) 
                overlap++ ;
        }
    return overlap * 100;
}

Option 3 - Rewrite your function in inline assembly code. You're on your own for this one.

Thank you for answers. Option 1 seems a good modification.. I'll try it. Option 2 is not really acceptable in my problem.. — wlee, Jun 14 '11 at 04:18
I tried option 1. The speed is a bit improved but performance gain is not huge. Maybe I have to consider direct implementations with NEON.. — wlee, Jun 14 '11 at 07:14
i have a few doubts in your 1st option. a_leny = a_leny / 4; a_pitch = a_pitch / 4; b_pitch = b_pitch / 4; Why are these values being divided.? They are being used in the y direction — Anoop K. Prabhu, Nov 04 '11 at 15:39
They're being divided because they are specifying coordinates/lengths/indices in bytes while the algorithm is indexed by 32-bit integers. To convert from an offset specified in bytes to one specified in 32-bit integers, you divide by 4. — aroth, Nov 05 '11 at 00:12

Jake 'Alquimista' LEE · Answer 2 · 2011-11-02T02:43:36.817

Your code is Rambo for the CPU - its worst nightmare :

byte access. Like aroth mentioned, ARM is VERY slow reading bytes from memory
random access. Two absolutely unnecessary multiply/add operations in addition to the already steep performance penalty by its nature.

Simply put, everything is wrong that can be wrong.

Don't call me rude. Let me be your angel instead.

First, I'll provide you a working NEON version. Then an optimized C version showing you exactly what you did wrong.

Just give me some time. I have to go to bed right now, and I have an important meeting tomorrow.

Why don't you learn ARM assembly? It's much easier and useful than x86 assembly. It will also improve your C programming capabilities by a huge step. Strongly recommended

cya

==============================================================================

Ok, here is an optimized version written in C with ARM assembly in mind.

Please note that both the pitches AND a_lenx have to be multiples of 4. Otherwise, it won't work properly.

There isn't much room left for optimizations with ARM assembly upon this version. (NEON is a different story - coming soon)

Take a careful look at how to handle variable declarations, loop, memory access, and AND operations.

And make sure that this function runs in ARM mode and not Thumb for best results.

unsigned int compare(unsigned int *a, unsigned int a_pitch, 
            unsigned int *b, unsigned int b_pitch, unsigned int a_lenx, unsigned int a_leny) 
{
    unsigned int overlap =0;
    unsigned int a_gap = (a_pitch - a_lenx)>>2;
    unsigned int b_gap = (b_pitch - a_lenx)>>2;
    unsigned int aval, bval, xcount;

    do
    {
        xcount = (a_lenx>>2);
        do
        {
            aval = *a++;
            // ldr      aval, [a], #4
            bval = *b++;
            // ldr      bavl, [b], #4
            aval &= bval;
            // and      aval, aval, bval

            if (aval & 0x000000ff) overlap += 1;
            // tst      aval, #0x000000ff
            // addne    overlap, overlap, #1
            if (aval & 0x0000ff00) overlap += 1;
            // tst      aval, #0x0000ff00
            // addne    overlap, overlap, #1
            if (aval & 0x00ff0000) overlap += 1;
            // tst      aval, #0x00ff0000
            // addne    overlap, overlap, #1
            if (aval & 0xff000000) overlap += 1;
            // tst      aval, #0xff000000
            // addne    overlap, overlap, #1
        } while (--xcount);

        a += a_gap;
        b += b_gap;
    } while (--a_leny);

    return overlap;
}

Could you explain the use of a_gap and b_gap a bit more clearer. Your code seems interesting, but the idea of using the 'gap' didnot strike me yet. — Anoop K. Prabhu, Nov 04 '11 at 15:41
+= gap serves as a kind of "carriage return". pitch is the width of whole image while lenx specifies the width of the image actually being processed. — Jake 'Alquimista' LEE, Nov 04 '11 at 20:40

score 0 · Answer 3 · answered Jun 14 '11 at 03:42

0

First of all, why the double loop? You can do it with a single loop and a couple of pointers.

Also, you don't need to calculate x+y*pitch for every single pixel; just increment two pointers by one. Incrementing by one is a lot faster than x+y*pitch.

Why exactly do you need to perform this operation? I would make sure there are no high-level optimizations/changes available before looking into a low-level solution like NEON.

answered Jun 14 '11 at 03:42

Andres Kievsky

3,461
32
25

Hello. It is a kind of image processing job. I tried to implement the function with pointers with a single for loop, but the speed is not that much faster. – wlee Jun 14 '11 at 04:17

Fastest way of bitwise AND between two arrays on iPhone?

3 Answers3