0

I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics:

... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available.

The trouble I am having is the solution offers code that is non-interleaved, and it performs fused multiplies on floating points. I'm trying to separate the two and understand just the interleaved loads.

According to the other question's comment and Coding for NEON - Part 1: Load and Stores, the answer is probably going to use VLD3.

Unfortunately, I'm just not seeing it (probably because I'm less familiar with NEON and its intrinsic functions). It seems like VLD3 basically produces 3 outputs for each input, so my metal model is confused.

Given the following SSE instrinsics that operate on data in BGR BGR BGR BGR... format that needs a shuffle for BBBB GGGG RRRR ...:

const byte* data = ...  // assume 16-byte aligned
const __m128i mask = _mm_setr_epi8(0,3,6,9,12,15,1,4,7,10,13,2,5,8,11,14);
__m128i a = _mm_shuffle_epi8(_mm_load_si128((__m128i*)(data)),mask);

How do we perform the interleaved loads using NEON intrinsics so that the we don't need the SSE shuffles?


Also note... I'm interested in intrinsics and not ASM. I can use ARM's intrinsics on Windows Phone, Windows Store, and Linux powered devices under MSVC, ICC, Clang, etc. I can't do that with ASM, and I'm not trying to specialize the code 3 times (Microsoft 32-bit ASM, Microsoft 64-bit ASM and GCC ASM).

Community
  • 1
  • 1
jww
  • 97,681
  • 90
  • 411
  • 885
  • I don't know NEON either, so I was interested to read that is has a deinterleaving load. It's pretty clear that yeah, `vld3` produces three output registers. Your SSE `pshufb` shuffles your data into 6 bytes of B, 5 bytes of G, then 5 bytes of R, all in one register. That's different from what `vld3` gives you, and seems less useful. Why do you need different colour components mixed in the same register? – Peter Cordes May 09 '16 at 01:31
  • @PeterCordes - *Why do you need different colour components mixed in the same register..."* - the actual problem is a BLAKE2 hash compression function. SSE2 and SSE4 available at [blake2.cpp](https://github.com/weidai11/cryptopp/blob/arm-neon/blake2.cpp); we are cutting in NEON. I used the other Stack Overflow question as a reference point to help understanding and avoid confusion. I'm also guessing more people understand RGB colors over a BLAKE2 compression function. – jww May 09 '16 at 01:49
  • 2
    _"It seems like `vld3` basically produces 3 outputs for each input"_ - yes, because the input is a base address pointing to interleaved data. Say that points to an array of ABCABCABCABC... then what you get is one register full of As, one full of Bs, and one full of Cs. If you specifically need _one_ single register to contain an AAAAAABBBBBBCCCCC pattern, then I think you're going to need some `vtbl` permutation regardless of how you load it. – Notlikethat May 09 '16 at 07:50

1 Answers1

2

According to this page:

The VLD3 intrinsic you need is:

int8x8x3_t  vld3_s8(__transfersize(24) int8_t const * ptr);
// VLD3.8 {d0, d1, d2}, [r0]

If at address pointed by ptr you have this data:

0x00: 33221100
0x04: 77665544
0x08: bbaa9988
0x0c: ffddccbb
0x10: 76543210
0x14: fedcba98

You will finally get in the registers:

d0: ba54ffbb99663300
d1: dc7610ccaa774411
d2: fe9832ddbb885522

The int8x8x3_t structure is defined as:

struct int8x8x3_t
{
   int8x8_t val[3];
};
Dric512
  • 3,525
  • 1
  • 20
  • 27
  • The only significant point of difference between this and the original SSE code the size of the transfer. The SSE code seemed to permute 128-bits of data, whereas NEON is going to permute 3*64-bit or 3*128-bit loads, so making this fit the original problem may require a bit of data size shuffling to make things fit the new data sizes. – solidpixel May 11 '16 at 20:51