0

I got a very simple but big(n is large) loop here:

for (i=0; i<n; i++)
{
    dst[i] = src[table[i]];
}

I want to optimize it using NEON but I don't know how to deal with this part:src[table[i]]. Is it possible to optimize? If yes, how?

luowenfeng
  • 21
  • 1
  • This is effectively a *gathered load*, and is not supported in NEON. See: http://stackoverflow.com/questions/11502332/arm-neon-how-can-i-change-value-with-a-index/11506069#11506069 – Paul R Jul 15 '15 at 07:24
  • 2
    If this is `uint8_t table[n]` or there are fewer than 256 instances in this mapping, then `vtbl` comes to rescue. The same is true, if the mapping `table[i]` would be locally monotonic. For generic case the answer is no. Voting to reopen as an effort to get extra information; Unless this is an ad hoc encryption function, I'd be tempted to believe in exploitable order in the mapping table. – Aki Suihkonen Jul 15 '15 at 08:07
  • 1
    @AkiSuihkonen: VTBL only handles (up to) the five least significant bits of the input. Do you see a way to break that barrier ? –  Jul 15 '15 at 08:33
  • 2
    Using a chain `vtbx.8 d0, {d4,d5,d5,d6}, d1; vadd.8 d1, d2; vtbx.8 d0, {d7, d8, d9, d10}, d1;` with d2 filled with 32 one can augment the table size to 64. To get 256 bytes one needs unfortunately spill some registers. – Aki Suihkonen Jul 15 '15 at 12:02

1 Answers1

1

Thanks for @Paul R and his comment:

This is effectively a gathered load, and is not supported in NEON.See: stackoverflow.com/questions/11502332/…

Since it couldn't optimized by NEON, I tried OpenMP, and got a significant improvement. And the code is rather simple too:

#pragma omp parallel for
for (i=0; i<n; i++)
{
    dst[i] = src[table[i]];
}
Community
  • 1
  • 1
luowenfeng
  • 21
  • 1