Why is the AVX-256 VMOVAPS Instruction only copying four single precision floats instead of 8?

Question

I am trying to familiarize myself with the 256-bit AVX instructions available on some of the newer Intel processors. I have already verified that my i7-4720HQ supports 256-bit AVX instructions. The problem I am having is that the VMOVAPS instruction, which should copy 8 single precision floating point values, is only copying 4.

dot PROC
    VMOVAPS YMM1, ymmword ptr [RCX]                
    VDPPS   YMM2, YMM1, ymmword ptr [RDX], 255      
    VMOVAPS ymmword ptr [RCX], YMM2                 
    MOVSS   XMM0, DWORD PTR [RCX]                  
    RET
dot ENDP

In case you aren't familiar with the calling convention, Visual C++ 2015 expects the return of this function (since it is a float) to be in XMM0 upon return.

In addition to this, the standard is for the first argument to be passed in RCX and the second argument to be passed in RDX.

Here is the C code that calls this function.

_declspec(align(32)) float d1[] = { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f };
_declspec(align(32)) float d2[] = { 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f };
printf("Dot Product Test: %f\n", dot(d1, d2));

The return value of the dot function is always 8.0. In addition to this, I have debugged the function and found that after the first assembly instruction, only four values get copied into YMM1. The rest of YMM1 remains zeroed.

Am I doing something wrong here? I've looked through the Intel documentation and some third party documentation. As far as I can tell I'm doing everything right. Am I using the wrong instruction? By the way, if you are here to tell me to use the Intel compiler intrinsics, don't bother.

You could get the same (non-working) behaviour in fewer instructions by using `ymm0` as the destination for `vdpps`. Then you wouldn't need to store or reload, just return. If the `__vectorcall` ABI is like the SysV ABI, you're allowed to leave non-zero garbage in parts of the register outside the bits that hold the return value. (e.g. high elements of a vector reg). For such a tiny function, writing it by hand in asm instead of something that can inline means the function-call overhead can be significant. Esp. since you pass args by ref, rather than in vector regs. — Peter Cordes, Apr 23 '16 at 01:55
I just wrote that code in two minutes to test the assembly instruction. I'm well aware that it can be optimized . — A. Robinson, Apr 23 '16 at 16:59

score 2 · Answer 1 · answered Apr 22 '16 at 16:16

2

You forgot to read the instruction set reference page for VDPPS. It mentions that the result is produced in two halves:

VDPPS (VEX.256 encoded version)
DEST[127:0] ← DP_Primitive(SRC1[127:0], SRC2[127:0]);
DEST[255:128] ← DP_Primitive(SRC1[255:128], SRC2[255:128]);

It's not the VMOVAPS that's wrong.

answered Apr 22 '16 at 16:16

Jester

56,577
4
81
125

I read the registers while debugging. VMOVAPS only copied half of the values into YMM1. That is my problem. – A. Robinson Apr 22 '16 at 16:20
1

Precisely. IIRC when Intel moved to 256-bit wide registers they duplicated the 128-bit unit atop the old one, and you actually pay a _lane-crossing penalty_ for instructions (like shuffles) that cause a movement of data from the top half to bottom half and vice-versa. – Iwillnotexist Idonotexist Apr 22 '16 at 16:21
Are you saying that I need to use two copy instructions to get the eight FP values into YMM1? – A. Robinson Apr 22 '16 at 16:21
1

@A.Robinson The upper-half dot-product you want is in bits [159:128] of the YMM register, so you need to shuffle it or move it downwards, sum it into the lower-half dot-product, and put *that* into XMM0. – Iwillnotexist Idonotexist Apr 22 '16 at 16:27
1

You only return the lowest float, so you get `8.0` because `VDPPS` only produces half of the result there. As to what you see in the debugger, you didn't provide any detailed proof of that. – Jester Apr 22 '16 at 16:28
I read the YMM1 register directly using the visual studio 2015 debugger. DIrectly after executing VMOVAPS, it only had four SP floats in it. – A. Robinson Apr 22 '16 at 16:29
2

Then your debugger is broken. I can't check with visual studio, but rest assured that `vmovaps` works correctly and loads all 8: `(gdb) p $ymm1.v8_float $1 = {1, 1, 1, 1, 1, 1, 1, 1}` – Jester Apr 22 '16 at 16:45
I just ran a careful test where I used VMOVAPS to copy into YMM1. I then shift the contents of YMM1 right by 16 bytes. I then print the value of the lowest four bytes of YMM1 (as a float) to the screen. The value is zero, despite the array being passed equalling `_declspec(align(32)) float test[] = { 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f };` This test works fine when shifting by 0, 4, 8, or 12 bytes. When I do that I see 1, 2, 3 and 4 respectively. This is clearly no a debugger issue. – A. Robinson Apr 22 '16 at 17:13
I don't know, does your OS support AVX256? Maybe it's not saving and restoring registers properly on a context switch. – Jester Apr 22 '16 at 17:19
I'm using windows 10. It does support avx256. I just checked to see if there was an associated BIOS setting. There is not. Not sure what to do at this point. This is getting very frustrating. – A. Robinson Apr 22 '16 at 17:29

score 1 · Answer 2 · answered Apr 22 '16 at 18:50

1

I just updated to visual studio 2015 update two, and now it is working properly. I have no idea why. My best guess is that MASM was converting my AVX256 code into AVX128 code for no good reason. Either way, problem solved.

answered Apr 22 '16 at 18:50

A. Robinson

29
1
3

1

If you have your old asm code, you could check with a disassembler. – Peter Cordes Apr 23 '16 at 01:58

Why is the AVX-256 VMOVAPS Instruction only copying four single precision floats instead of 8?

2 Answers2

Linked