3

I tried to find a KNC broadcast instruction for Xeon Phi platform. But I could not find any instruction. Instead I tried to use this AVX _mm512_set1_epi32 intrinsic in assembly. I have two questions: first is there any KNC broadcast instruction? Second, when I compiled the below code, I got the operand type mismatch for `vpbroadcastd' error.

int op = 2;
__asm__("vmovdqa32 %0,%%zmm0\n\t"
            "mov %1, %%eax\n\t"
            "vpbroadcastd %%eax, %%zmm1\n\t"
            "vpsravd %%zmm1,%%zmm0,%%zmm1\n\t"
            "vmovdqa32 %%zmm1,%0;"
            : "=m" (tt[0]): "m" (op));

which tt defined using below code and I used k1om-mpss-linux-gcc compiler for compiling this code

int * tt = (int *) aligned_malloc(16 * sizeof(int),64);
Hamid_UMB
  • 317
  • 4
  • 16
  • 2
    According the Xeon Phi Instruction Set manual VPBROADCASTD only takes a memory location as the source operand. The AVX 2 version takes either a memory location or an XMM register. Neither allows EAX as the source. – Ross Ridge Dec 19 '15 at 20:24
  • @RossRidge thank you for your reply. My question is what is the right way to use broadcast or set instruction in Xeon Phi instruction? – Hamid_UMB Dec 19 '15 at 23:07
  • @RossRidge: regular AVX512F *does* allow broadcast from a GP register. Xeon Phi doesn't have that? That would explain the problem. In that case, the solution is just to not load into eax first, since the OP is forcing the compiler to put it in memory anyway. – Peter Cordes Dec 20 '15 at 00:20
  • @RossRidge: Peter suggest the _mm512_set1_epi32(int) intrinsic. I just want to know the assembly version of this instruction. I don't know what is r32? and how I can load to r32? – Hamid_UMB Dec 20 '15 at 02:56
  • 2
    I think maybe you've bitten off more than you can chew here. As Intel's documentations states there is no assembly instruction that corresponds to `_mm512_set1_epi32`. Ideally this intrinsic doesn't generate an instruction, the broadcast is done for free using the `{1to16}` operand transformation. To make effective use of assembly with the Xeon Phi you need to know things like that. You shouldn't be asking about basic things like the meaning of `r32`. – Ross Ridge Dec 20 '15 at 06:44
  • @Hamid_UMB, since you have given me the accepted answer could you please say whether or not my solution worked for you? I am particuarly interested in this line "vpsravd (%1)%{1to16}, %%zmm0, %%zmm0\n\t". Did this work. I don't have KNC or AVX512 hardware to test with. – Z boson Dec 23 '15 at 08:00

2 Answers2

3

An earlier version of this answer was wrong. According to An Intels PDF of the KNC insn from Sep 2012, which I hope is current/up-to-date, 512b vpsrad is only available with an immediate count. It does appear rather inconvenient when you have the count in a GP register (rather than memory).

It appears that the variable-count shift (vpsravd) is the only way to do non-immediate-count shifts on KNC, even with the same count for every element. Since it can use a broadcast load for the shift count, that's not a huge problem. KNC also appears to have a "swizzle" shuffle or broadcast from a register source (zmm1 {aaaa}), but I'm not sure what the width of that broadcast is.

This doesn't compile on a normal compiler: the {1to16} is ignored, and you get an error that "broadcast is needed for operand of such type for `vpsravd'". IDK if that's just a syntax problem, with intel-syntax instead of AT&T.

// compile with -masm=intel
// todo: something clever to use vpsrad when the shift count is a compile-time constant
void shift_KNC(int *A, int n) {

  __asm__ volatile(
    // ".intel_syntax noprefix\n"
    "vmovdqa32      zmm0, %0\n\t"
    "vpsravd        zmm0, zmm0, %1 {1to16}\n\t"
    "vmovdqa32      %0,  zmm0\n\t"
    : "+m" (*(__m512i*)A)
    : "m" (n) /* force it to memory */
    : "%zmm0"
  );
}

Still using a full "memory" clobber because we're only telling the compiler about using the first integer as an input/output memory operand, not the next 16.

If you can keep the zmm value in memory, instead of storing/reloading between tiny fragments of inline asm, that will perform much better.


According to Xeon Phi Knights Corner intrinsics with GCC, gcc doesn't support intrinsics for KNC.


I think the PDF I have is for AVX512 (KNL/Skylake-E). IDK about KNC; it may not have this. (specifically: Intel® Architecture Instruction Set Extensions Programming Reference, from Oct 2014.)

There is a GP-register source form of VPBROADCASTD, requiring only AVX512F. VPBROADCASTD zmm1 {k1}{z}, r32. The intrinsic is

__m512i _mm512_maskz_set1_epi32( __mmask16 k, int a);

There isn't one listed without the mask, but maybe try just _mm512_set1_epi32(int).

BTW, your inline assembly compiles ok with a normal compiler on godbolt. (The "binary" checkbox makes it actually assemble and then disassemble, so I'm sure the instructions were accepted.)

If you still go with inline asm, instead of intrinsics, make sure you tidy up your code: If you're going to require the compiler to put op in memory, use a broadcast-load, rather than a mov into a GP register and broadcasting from there. Even better, use a broadcast-load memory operand for vpsravd: VPSRAVD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst. Then you never need a VPBROADCAST instruction at all. (I assume the compiler would do this with intrinsics.)

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you for your answer. My question is how to fill a 512 bit register with 16 integer. Right now I have to create a loop that iterate 16 times and then I load it into register, which does not make any sense. Because in SSE and AVX instructions, we have such instructions. – Hamid_UMB Dec 20 '15 at 02:44
  • Also, I don't have intel cc compiler, I have to use asm – Hamid_UMB Dec 20 '15 at 02:47
  • My another question is : what is r32?? How I can load the count into r32? – Hamid_UMB Dec 20 '15 at 02:54
  • @Hamid_UMB: `r32` is any 32bit register. Read Intel's instruction reference manual if you want to use asm. Also, gcc/clang/msvc support intrinsics for AVX/AVX2/AVX512. Do they not have them for KNC's variant of AVX512? – Peter Cordes Dec 20 '15 at 02:57
  • @PeterCordes, do you have any clue why someone downvoted my answer? – Z boson Dec 20 '15 at 19:22
  • Since you're mixing intrinsics and inline assembly you could just use inline assembly all the way since AVX512 and KNC have the same intrinsics in this case (using `vpsrad`). – Z boson Dec 20 '15 at 21:03
  • @Zboson: The idea was to let the compiler see as much of what I'm doing as possible, if it doesn't support KNC 512b intrinsics. So multiple calls with the same shift count might not generate repeated `vmovd`. And if I'm lucky, it might even see through the `si32_si128` intrinsic and use an immediate operand when the count is a compile-time constant. That definitely won't happen if I use `vmovd` in inline asm. – Peter Cordes Dec 20 '15 at 21:07
  • What I mean is that the compiler supports AVX512 intrinsics and in this case KNC and AVX512 have the same intrinsics so there is no problem. – Z boson Dec 20 '15 at 21:09
  • 1
    @Zboson: Are you sure? http://stackoverflow.com/questions/26933394/xeon-phi-knights-corner-intrinsics-with-gcc from a year ago makes it sound like gcc would probably never have KNC support. Also, it seems that KNC's 512b `vpsrad` only takes an immediate, unlike AVX2. So the OP was right to try to broadcast if he needs a non-immediate count. – Peter Cordes Dec 20 '15 at 21:32
  • 1
    It appears one can't count on the AVX512 intrinsics even when they are the same as KNC as much as I though. This works for some operations such as load and store and mulps such as in this [answer](http://stackoverflow.com/questions/34148636/segmentation-fault-for-vmovaps/34221295#34221295) but clearly in the `vpsrad` case there is a disconnect. – Z boson Dec 21 '15 at 09:18
  • @Zboson: also, it doesn't look like there's a way to specify a register swizzle for an operand to an intrinsic. At least, a non-KNC compiler certainly won't fold a shuffle into a swizzled source operand in the asm. So you're losing out on a significant feature of KNC. I guess AVX512F broadcast-from-memory syntax is different, because I couldn't get it to assemble on godbolt. If I was really trying, I'd have tried yasm locally as well, not just the GNU assembler's Intel mode. – Peter Cordes Dec 21 '15 at 09:40
  • I found [this link ](http://reviews.llvm.org/D15076#504820a5) where it has `vpackssdw (%rdi){1to16}, %zmm0, %zmm0`. I tried `"vpsravd (%1){1to16}, %%zmm0, %%zmm0\n"` but GCC converts it to `vpsravd (%rsi)1to16, %zmm0, %zmm0`. So I think it's just missing the braces. – Z boson Dec 21 '15 at 09:54
  • Agner Fog has a great section "8.4 Assembly syntax for AVX-512 and Knights Corner instructions" for his objconv program http://www.agner.org/optimize/objconv-instructions.pdf – Z boson Dec 21 '15 at 09:58
  • 1
    I think I got it `"vpsravd (%1)%{1to16}, %%zmm0, %%zmm0\n\t"` produces `vpsravd (%rdx){1to16}, %zmm0, %zmm0`. – Z boson Dec 21 '15 at 10:39
  • @Zboson: And that assembles? Was the lack of space between the operand and the `{1to16}` important? (Also, I don't think I ever tried `%`-escaping the open brace, but not the close.) If you're on godbolt, remember that unless you click "binary", it's just doing `gcc -S`. Or does it check for valid asm, too? I forget. – Peter Cordes Dec 21 '15 at 10:52
  • 1
    @PeterCordes, yes it assembles. I just tried binary and it's no problem. It even shows the opcodes. If I understand Agner correctly the only difference between KNC and AVX512, at least when the instructions are equal is a single bit. – Z boson Dec 21 '15 at 11:49
  • [I see where you got this escaping from](http://stackoverflow.com/questions/34327831/invalid-asm-nested-assembly-dialect-alternatives). I tried `"vpsravd (%1)%{1to16%}, %%zmm0, %%zmm0\n\t"` as well and it does the same thing so it appears the second escape is not necessary. If I don't use an escape it fails with `junk `1to16' after expression`. – Z boson Dec 21 '15 at 12:25
  • 1
    I was about to ask a question about memory broadcasts (with 1to16) because I could not get GCC to do it with intrinsics and then I found [this](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351) very interesting. As I suspected if the value (`op`) in the OPs case is a already in register then an explicit broadcast instruction is better. – Z boson Dec 22 '15 at 08:10
  • @Zboson: yup, store integer reg -> memory and then broadcast is pretty terrible. `vmovd` / `vpbroadcast` is also pretty bad, so having shift counts in GP registers isn't a good place to be in the first place. Writing a function that takes a count as an integer parameter is a bad idea. The OP didn't give any context for where the count is coming from. :/ Hopefully it's in memory and needs to be loaded anyway. – Peter Cordes Dec 22 '15 at 08:27
  • I am not sure the OP thinks he/she is expecting to produce the most optimal code now. I think it's mostly for hacking/learning. – Z boson Dec 22 '15 at 08:38
  • http://stackoverflow.com/questions/34415238/embedded-broadcasts-with-intrinsics-and-assembly – Z boson Dec 22 '15 at 11:47
3

I looked at how AVX2 would do this with intrinsics and noticed that the broadcast reads from memory just like with KNC. Looking at the assembly from the AVX2 intrinsics I wrote inline assembly which does the same thing.

#include <stdio.h>
#include <x86intrin.h>
void foo(int *A, int n) {
    __m256i a16 = _mm256_loadu_si256((__m256i*)A);
    __m256i t = _mm256_set1_epi32(n);
    __m256i s16 = _mm256_srav_epi32(a16,t);
    _mm256_storeu_si256((__m256i*)A, s16);
}

void foo2(int *A, int n) {
    __asm__("vmovdqu      (%0),%%ymm0\n"
            "vpbroadcastd (%1), %%ymm1\n"
            "vpsravd      %%ymm1, %%ymm0, %%ymm0\n"
            "vmovdqu      %%ymm0, (%0)"
            :
            : "r" (A), "r" (&n)
            : "memory"
        );
}

int main(void) {
    int x[8];
    for(int i=0; i<8; i++) x[i] = 1<<i;
    for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
    foo2(x,2);
    for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
}

Here is my guess for KNC (using aligned loads):

void foo2_KNC(int *A, int n) {
    __asm__("vmovdqa32      (%0),%%zmm0\n"
            "vpbroadcastd   (%1), %%zmm1\n"
            "vpsravd        %%zmm1, %%zmm0, %%zmm0\n"
            "vmovdqa32      %%zmm0, (%0)"
            :
            : "r" (A), "r" (&n)
            : "memory"
        );
}

There appears to be a more efficient way of doing this with KNC and AVX512.

Intel says in regards to AVX12 in section "2.5.3 Broadcast":

EVEX encoding provides a bit-field to encode data broadcast for some load-op instructions

and then gives the example

vmulps zmm1, zmm2, [rax] {1to16}

where

The {1to16} primitive loads one float32 (single precision) elem ent from memory, replicates it 16 times to form a vector of 16 32-bit floating-point elements, multiplies the 16 float32 elements with the corresponding elements in the first source operand vector, and put each of the 16 results into the destination operand.

I have never used his syntax before but you could try

void foo2_KNC(int *A, int n) {
__asm__("vmovdqa32      (%0),%%zmm0\n\t"
        "vpsravd        (%1)%{1to16}, %%zmm0, %%zmm0\n\t"
        "vmovdqa32      %%zmm0, (%0)\t"
        :
        : "r" (A), "r" (&n)
        : "memory", "%zmm0"
    );

}

this produces

vmovdqa32      (%rax),%zmm0
vpsravd        (%rdx){1to16}, %zmm0, %zmm0
vmovdqa32      %zmm0, (%rax)

Agner Fog incidentally has a section titled "8.4 Assembly syntax for AVX-512 and Knights Corner instructions" in the documentation for objconv where he says

these two instruction sets are very similar, but have different optional instruction attributes. Instructions from these two instruction sets differ by a single bit in the prefix, even for otherwise identical instructions.

According to his documentation NASM supports the AVX-512 and KNC syntax so you could try this syntax in NASM.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Downvoter (or someone who understand why I was downvoted) could you please explain your downvote? – Z boson Dec 20 '15 at 18:54
  • Just noticed something, now that the asm is less noisy: Broadcast and then variable-shift???? Why not `vmovd %1, %%zmm1` (with an `rm` constraint) then `vpsrad %%zmm1, %%zmm0, %%zmm0`? I assume KNC's 512b vectors extend that SSE2 instruction to 512, taking the shift count for all elements from the low 64b of the `%zmm1`. – Peter Cordes Dec 20 '15 at 20:24
  • 1
    @PeterCordes I think I was downvoted based on Ross Ridge's comment "Ideally this intrinsic doesn't generate an instruction, the broadcast is done for free using the {1to16} operand transformation. To make effective use of assembly with the Xeon Phi you need to know things like that." – Z boson Dec 20 '15 at 20:27
  • @PeterCordes I am trying to get GCC with AVX512 to do this {1to16} thing with a multiply but all I have see it do so far is `vpbroadcastd %esi, %zmm0{%k1}{z}`. I have to admit I am new to this. – Z boson Dec 20 '15 at 20:31
  • I'd maybe try assembling a broadcast-load in YASM or NASM, then disassembling to get the AT&T syntax, if you can't find a reference. – Peter Cordes Dec 20 '15 at 20:36
  • @PeterCordes, you were right KNC and AVX512 both have `_mm512_srai_epi32` which does not need the broadcast. I updated my answer based on this. I hope you don't mind. – Z boson Dec 20 '15 at 20:46
  • not at all. I stole your inline asm as a starting point for a new first part of my answer :) – Peter Cordes Dec 20 '15 at 20:51