4

I have programmed a routine to process single float arrays using Neon on the Android platform, specifically the Samsung S4, and find that my Neon routines are limited by the access to the array data. For interests sake, snippet below:

Neon

m1 =  vmulq_f32(*(float32x4_t *)&ey[i][j],*(float32x4_t *)&caey[i][j]);
                m2 =  vsubq_f32(*(float32x4_t *)&hz[i-1][j],*(float32x4_t *)&hz[i][j]);
                m3 =  vmulq_f32(*(float32x4_t *)&cbey[i][j],m2);
                m4 =  vaddq_f32(m1,m3); 
                vst1q_f32(&ey[i*je+j],m4);

Serial

ey[i][j] = caey[i][j] * ey[i][j] + cbey[i][j] * ( hz[i-1][j] - hz[i][j] ); 

Built on Android phone using C4droid gcc and also AIDE-JNI. The Neon intrinsics code above takes slightly longer to process than the serial equivalent. When replacing the array data with dummy const floats then the code runs nearly 4 times as quick as the serial with array data, although it will of course produce nonsense results (this does confirm that the performance problem lies with the data access). My equivalent SSE and AVX code on other platforms produces good speedups.

I have tried 1D equivalent arrays and prefetching data with __builtin_prefetch , but can not speed up the data access to the Neon intrinsics.

Is there anything else I can try to improve the data access performance on the Android phone ??

MMM
  • 3,132
  • 3
  • 20
  • 32
magicfoot
  • 41
  • 1
  • Manual prefetching is a bit of a black art, so without knowing exactly what the loop and data layout look like (i.e. how many iterations, ordering of `i` and `j`, cache line alignment, etc.) it's a bit hard to say. Disassembly of the NEON loop might be worth looking at too, just to check the complier's not doing anything particularly silly. However, since you're doing effectively nothing but memory accesses (the calculations themselves are pretty trivial), it's possible you're already saturating L2 bandwidth thanks to the auto-prefetcher, in which case there's nowhere to go. – Notlikethat Dec 15 '14 at 11:50
  • 1) GCC is famous for producing bad code from intrinsics. 2) You are leaving the register/memory read/write in the hands of the compiler and no interleaving to reduce stalls. 3) We don't know your memory alignment situation. 4) Prefetching properly will probably speed it up a bit. – BitBank Dec 15 '14 at 15:28
  • Neon intrinsics have been updated in gcc4.9. The `pld` instruction will fill cache lines. Your L1/L2 line size/geometry and DDR speed will be a factor. You need to prefetch each of the five elements and it seems that the prior `hz` array might be in your cache already. You might also try the `-ffast-math`, as I think that the ARM floating point is not 100% IEEE compatible and the compiler will often insert some pedantic checks; but really you need to look at the assembler. Related: [Neon float](http://stackoverflow.com/questions/12420050/neon-float-multiplication-is-slower-than-expected). – artless noise Dec 15 '14 at 16:24
  • Also, this device comes with [different CPU configurations](http://en.wikipedia.org/wiki/Samsung_Galaxy_S4); a SnapDragon 600 SOC or a Quad-Core A15/A7 system. Results probably depend on this. The A15 L1 line is 16 words, whereas the A7 is only 8. [Here is a gcc patch](https://gcc.gnu.org/ml/gcc-patches/2010-06/msg02102.html) where the *de-norm* issues of NEON are mentioned. Obviously, if you need this precision (close to zero) then you can not use `-ffast-math`. – artless noise Dec 15 '14 at 16:32
  • I smell a rat here: GCC seems to be the culprit here again. I could give you a definitive answer if you posted the disassembly. Two things I can tell you now is that you should unroll so that the instruction latencies vanish/diminish, and replace the mul + add with mla. – Jake 'Alquimista' LEE Dec 16 '14 at 08:59

0 Answers0