According to my regularly used source Searchable Neon Arm Intrinsic Guide, there are only these (four classes of) intrinsics for lookup table with 8 byte target register (uint8x8 and poly8x8_t variants omitted for brevity).
int8x8_t vtbl1_s8 (int8x8_t a, int8x8_t b)
int8x8_t vtbl2_s8 (int8x8x2_t a, int8x8_t b)
int8x8_t vtbl3_s8 (int8x8x3_t a, int8x8_t b)
int8x8_t vtbl4_s8 (int8x8x4_t a, int8x8_t b)
To a surprise my source code
uint8x16_t oddeven(uint8x16_t a) {
auto l = vget_low_u8(a);
auto h = vget_high_u8(a);
auto lh = vuzp_u8(l,h);
return vcombine_u8(lh.val[0], lh.val[1]);
}
produced this practically single instruction code for odd/even interleaving of a 16-byte vector:
adrp x8, .LCPI0_0
ldr q1, [x8, :lo12:.LCPI0_0]
tbl v0.16b, { v0.16b }, v1.16b
ret
So there it is, tbl v0.16.b, { }
variant apparently performing a full 16->16 permutation of the original data in a single instruction. Is this (un)documented, or can it be otherwise produced with intrinsics?