I'm applying an NV12 video transformation which shuffles pixels of the video. On an ARM device such as Google Nexus 7 2013, performance is pretty bad at 30fps for a 1024x512 area with the following C code:
* Pre-processing done only once at beginning of video *
//Temporary tables for the destination
for (j = 0; j < height; j++)
for (i = 0; i < width; i++) {
toY[i][j] = j * width + i;
toUV[i][j] = j / 2 * width + ((int)(i / 2)) * 2;
}
//Temporary tables for the source
for (j = 0; j < height; j++)
for (i = 0; i < width; i++) {
fromY[i][j] = funcY(i, j) * width + funcX(i, j);
fromUV[i][j] = funcY(i, j) / 2 * width + ((int)(funcX(i, j) / 2)) * 2;
}
* Process done at each frame *
for (j = 0; j < height; j++)
for (i = 0; i < width; i++) {
destY[ toY[i][j] ] = srcY[ fromY[i][j] ];
if ((i % 2 == 0) && (j % 2 == 0)) {
destUV[ toUV[i][j] ] = srcUV[ fromUV[i][j] ];
destUV[ toUV[i][j] + 1 ] = srcUV[ fromUV[i][j] + 1 ];
}
}
Though it's computed only once, funcX/Y is a pretty complex transformation so it's not very easy to optimize this part.
Is there still a way to fasten the double loop computed at each frame with the given "from" tables?