Take a look here at the loop section - http://en.wikipedia.org/wiki/ARM_architecture
Basically you'll want something like:
void brighten(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"\t mov r3, #0\n"
"Lloop:\n"
"\t cmp r3, %2\n"
"\t bge Lend\n"
"\t ldrb r4, [%0, r3]\n"
"\t add r4, r4, %3\n"
"\t strb r4, [%1, r3]\n"
"\t add r3, r3, #1\n"
"\t b Lloop\n"
"Lend:\n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r3", "r4");
}
Update:
And here's that NEON version:
void brighten_neon(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"\t mov r4, #0\n"
"\t vdup.8 d1, %3\n"
"Lloop2:\n"
"\t cmp r4, %2\n"
"\t bge Lend2\n"
"\t vld1.8 d0, [%0]!\n"
"\t vqadd.s8 d0, d0, d1\n"
"\t vst1.8 d0, [%1]!\n"
"\t add r4, r4, #8\n"
"\t b Lloop2\n"
"Lend2:\n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r4", "d1", "d0");
}
So this NEON version will do 8 at a time. It does however not check that numPixels
is divisible by 8 so you'd definitely want to do that otherwise things will go wrong! Anyway, it's just a start at showing you what can be done. Notice the same number of instructions, but action on eight pixels of data at once. Oh and it's got the saturation in there as well that I assume you would want.