I've cobbled together a neon equivalent to the SSE2 intrinsic _mm_shuffle_epi8.
The code I currently have for this purpose is:
static __forceinline __n128 shuffle8(
const __n128& a,
__n128 b) throw()
{
__n64x2 in =
{
a.DUMMYNEONSTRUCT.low64,
a.DUMMYNEONSTRUCT.high64
};
b.DUMMYNEONSTRUCT.low64 = vtbl2_u8(in, b.DUMMYNEONSTRUCT.low64);
b.DUMMYNEONSTRUCT.high64 = vtbl2_u8(in, b.DUMMYNEONSTRUCT.high64);
return b;
}
Now, I'm not necessarily set on this being the final form of things; but that's not the question yet. I've been testing my code and have found that what I've given works exactly as I intend it to when building/running in debug mode, but NOT when building/running in release mode. By way of example:
#define simd_shuffle8(a, b) shuffle8(a, b)
...
simd test = keyschedule[1];
test = simd_shuffle8(test, test);
keyschedule[1] has an initial value of
{0x858efc16, 0x8801f2e2, 0x1f0fb923, 0x11ecb78e}
In debug mode, test ends with a value of
{0x00000000, 0x00fc0000, 0x00110000, 0x00000000}
which is as it should be. In release mode, test ends with a value of
{0x16161616, 0x16001616, 0x16161616, 0x16001616}
which is not as it should be. What is likely to be causing this issue/how might I fix it?