ARM NEON Optimization for image transformation

Question

I'm applying an NV12 video transformation which shuffles pixels of the video. On an ARM device such as Google Nexus 7 2013, performance is pretty bad at 30fps for a 1024x512 area with the following C code:

* Pre-processing done only once at beginning of video *

//Temporary tables for the destination
for (j = 0; j < height; j++)
    for (i = 0; i < width; i++) {
        toY[i][j] = j * width + i;
        toUV[i][j] = j / 2 * width + ((int)(i / 2)) * 2;
    }

//Temporary tables for the source
for (j = 0; j < height; j++)
    for (i = 0; i < width; i++) {
        fromY[i][j] = funcY(i, j) * width + funcX(i, j);
        fromUV[i][j] = funcY(i, j) / 2 * width + ((int)(funcX(i, j) / 2)) * 2;
    }

* Process done at each frame *

for (j = 0; j < height; j++)
    for (i = 0; i < width; i++) {
        destY[ toY[i][j] ] = srcY[ fromY[i][j] ];
        if ((i % 2 == 0) && (j % 2 == 0)) {
            destUV[ toUV[i][j] ] = srcUV[ fromUV[i][j] ];
            destUV[ toUV[i][j] + 1 ] = srcUV[ fromUV[i][j] + 1 ];
        }
    }

Though it's computed only once, funcX/Y is a pretty complex transformation so it's not very easy to optimize this part.

Is there still a way to fasten the double loop computed at each frame with the given "from" tables?

It is probably much better to not to use any tables for indexes - try doing arithmetic all the time. Memory bandwidth is more scarce than CPU. May be you can tell how you traverse the image more clearly, and people can give you better ideas on how to improve. — auselen, Oct 18 '13 at 06:55

Jake 'Alquimista' LEE · Accepted Answer · 2013-10-23T05:59:09.583

You create FOUR lookup tables 8 times as large as the original image?

You put an unnecessary if statement in the inner most loop?

What about swapping i and j?

Honestly, your question should be tagged with [c] instead of arm, neon, or image-processing to start with.

Since you didn't show what funcY and funcX do, the best answer I can give is following. (Performance skyrocketed. And it's something really really fundamental)

//Temporary tables for the source
pTemp = fromYUV;
for (j = 0; j < height; j+=2)
{
    for (i = 0; i < width; i+=2) {
       *pTemp++ = funcY(i, j) * width + funcX(i, j);
       *pTemp++ = funcY(i+1, j) * width + funcX(i+1, j);
       *pTemp++ = funcY(i, j) / 2 * width + ((int)(funcX(i, j) / 2)) * 2;
   }
    for (i = 0; i < width; i+=2) {
       *pTemp++ = funcY(i, j+1) * width + funcX(i, j+1);
       *pTemp++ = funcY(i+1, j+1) * width + funcX(i+1, j+1);
   }
}

* Process done at each frame *
pTemp = fromYUV;
pTempY = destY;
pTempUV = destUV;
for (j = 0; j < height; j+=2)
{
    for (i = 0; i < width; i+=2) {
        *pTempY++ = srcY[*pTemp++];
        *pTempY++ = srcY[*pTemp++];
        *pTempUV++ = srcUV[*pTemp++];
    }
    for (i = 0; i < width; i+=2) {
        *pTempY++ = srcY[*pTemp++];
        *pTempY++ = srcY[*pTemp++];
    }
}

You should avoid (when possible) :

access on multiple memory area
random memory access
if statements within loops

The worst crime you committed is the order of i and j. (Which you don't need to start with)

If you access a pixel at the coordinate x and y, it's pixel = image[y][x] and NOT image[x][y]

ARM NEON Optimization for image transformation

1 Answers1