I am developing a native library for Android where I use ARM assembly optimizations and multithreading in order to get maximum performance on the dual-core ARM chipset MSM8660. While doing some measurements I noticed the following:
- The single-threaded library with NEON optimizations is faster than the single-threaded library with ARMv6 optimizations (as expected).
- The multi-threaded library with ARMv6 optimizations is faster than the single-threaded library with ARMv6 optimizations (as expected).
- The multi-threaded library with NEON optimizations is slower than the single-threaded library with NEON optimizations (definitely not expected!).
I have tried searching all over the net for an explanation for why this is but have so far not found any. It almost seems like all the cores share the same NEON pipeline or something like that, but all schematics seem to indicate that each core should have its own NEON unit. Does anyone know why this is happening?