how to optimize a[i] = b[c[i]] with NEON

Question

I got a very simple but big(n is large) loop here:

for (i=0; i<n; i++)
{
    dst[i] = src[table[i]];
}

I want to optimize it using NEON but I don't know how to deal with this part:src[table[i]]. Is it possible to optimize? If yes, how?

This is effectively a *gathered load*, and is not supported in NEON. See: http://stackoverflow.com/questions/11502332/arm-neon-how-can-i-change-value-with-a-index/11506069#11506069 — Paul R, Jul 15 '15 at 07:24
If this is `uint8_t table[n]` or there are fewer than 256 instances in this mapping, then `vtbl` comes to rescue. The same is true, if the mapping `table[i]` would be locally monotonic. For generic case the answer is no. Voting to reopen as an effort to get extra information; Unless this is an ad hoc encryption function, I'd be tempted to believe in exploitable order in the mapping table. — Aki Suihkonen, Jul 15 '15 at 08:07
@AkiSuihkonen: VTBL only handles (up to) the five least significant bits of the input. Do you see a way to break that barrier ? — , Jul 15 '15 at 08:33
Using a chain `vtbx.8 d0, {d4,d5,d5,d6}, d1; vadd.8 d1, d2; vtbx.8 d0, {d7, d8, d9, d10}, d1;` with d2 filled with 32 one can augment the table size to 64. To get 256 bytes one needs unfortunately spill some registers. — Aki Suihkonen, Jul 15 '15 at 12:02

score 1 · Answer 1 · edited May 23 '17 at 12:14

1

Thanks for @Paul R and his comment:

This is effectively a gathered load, and is not supported in NEON.See: stackoverflow.com/questions/11502332/…

Since it couldn't optimized by NEON, I tried OpenMP, and got a significant improvement. And the code is rather simple too:

#pragma omp parallel for
for (i=0; i<n; i++)
{
    dst[i] = src[table[i]];
}

edited May 23 '17 at 12:14

Community

answered Jul 16 '15 at 08:52

luowenfeng

1 Answers1