I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics:
... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available.
The trouble I am having is the solution offers code that is non-interleaved, and it performs fused multiplies on floating points. I'm trying to separate the two and understand just the interleaved loads.
According to the other question's comment and Coding for NEON - Part 1: Load and Stores, the answer is probably going to use VLD3
.
Unfortunately, I'm just not seeing it (probably because I'm less familiar with NEON and its intrinsic functions). It seems like VLD3
basically produces 3 outputs for each input, so my metal model is confused.
Given the following SSE instrinsics that operate on data in BGR BGR BGR BGR...
format that needs a shuffle for BBBB GGGG RRRR ...
:
const byte* data = ... // assume 16-byte aligned
const __m128i mask = _mm_setr_epi8(0,3,6,9,12,15,1,4,7,10,13,2,5,8,11,14);
__m128i a = _mm_shuffle_epi8(_mm_load_si128((__m128i*)(data)),mask);
How do we perform the interleaved loads using NEON intrinsics so that the we don't need the SSE shuffles?
Also note... I'm interested in intrinsics and not ASM. I can use ARM's intrinsics on Windows Phone, Windows Store, and Linux powered devices under MSVC, ICC, Clang, etc. I can't do that with ASM, and I'm not trying to specialize the code 3 times (Microsoft 32-bit ASM, Microsoft 64-bit ASM and GCC ASM).