15

I can't find them in the Intel Intrinsic Guide v2.7. Do you know if AVX or AVX2 instruction sets support them?

elmattic
  • 12,046
  • 5
  • 43
  • 79
  • 2
    Gathered loads: http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_bk_avx2_masked_gather.htm - I don't see the scattered store instrinsics though – Paul R Dec 24 '12 at 11:24
  • 3
    From RWT: _[AVX2 does not include scatter instructions (i.e., vector addressed stores), because of complications with the x86 memory ordering model and the load/store buffers.](http://www.realworldtech.com/haswell-cpu/2/)_ – elmattic Dec 27 '12 at 14:32

2 Answers2

23
  • There are no scatter or gather instructions in the original AVX instruction set.

  • AVX2 adds gather, but not scatter instructions.

  • AVX512F includes both scatter and gather instructions.

  • AVX512PF additionally provides prefetch variants of gather and scatter instructions.

  • AVX512CD provides instructions to detect conflicts in scatter addresses.

  • Intel MIC (aka Xeon Phi, Knights Corner) does include gather and scatter instructions, but it is a separate coprocessor, and it can not run normal x86-64 code.

Marat Dukhan
  • 11,993
  • 4
  • 27
  • 41
  • 1
    @Jeff No it doesn't! KNC even has a separate ELF machine type – Marat Dukhan Nov 30 '15 at 06:31
  • 2
    @Jeff: KNL (Knight's Landing) should run x86_64 machine code, though, right? It's even going to be available as a host CPU, rather than just coprocessor. – Peter Cordes Nov 30 '15 at 07:22
  • 1
    @PeterCordes Yes. I have binaries that run on both Haswell Xeon E3 with AVX2 and Knights Landing with AVX-512. – Jeff Hammond Nov 30 '15 at 13:02
  • @MaratDukhan That's mixing two issues. Mac and Linux ELF binaries aren't compatible yet they may both be for x86_64. Let's not mix up HW and OS. – Jeff Hammond Nov 30 '15 at 13:04
12

As the other answer indicated, it is not possible to implement scatter for now, even on AVX2. However intel Optimization manual does provide us with a hand written version of scatter operation. It is on page 11-17 of Intel optimization manual 2013 version. Basically what do they do is they read the index everytime and store it into a general-purpose register, say, rax and then shift the correct number you want to a xmm register using things like vpalignr. Then we store the result to memory location with vmovss---move scalar single to memory. I guess this will be of low efficiency but I guess this is the only way to realize data scatter on X86 CPU architecture for now. On Xeon Phi things are beautiful, they provide native support for scatter operations and the first op, of course, is a memory location. So I believe if your code involves a lot of gather and scatter, switching to Xeon Phi will be a good choice. Please do reply to tell me if there is anything wrong in my reply.

Good Luck!

xiangpisaiMM

xiangpisaiMM
  • 160
  • 1
  • 4
  • 1
    Thanks for your insight, my hope is more into AVX3 (because it will probably bring native scatter with the unification of Core and MIC simd instructions). – elmattic Jul 15 '13 at 08:06
  • 1
    shift and then store sounds slower than using `extractps`, since the element to extract is a compile-time constant. Or maybe the same speed, but smaller code-size, since it still has to use the shuffle port. – Peter Cordes Nov 30 '15 at 07:24
  • @xian, Is there a way to contact you? – Royi Jul 01 '16 at 11:41