SSE loop over array gets the wrong value (dot product of two arrays of doubles)

Question

I have problem with my assembly code: I need to multiply two arrays, then add up the result and get a square root out of it. I've did the code and looks like it works fine, but I need to receive 9.16, but instead I'm getting 9.0.

I guess problem somewhere in the loop or in addpd, but I don't know how to fix it.

include /masm64/include64/masm64rt.inc
INCLUDELIB MSVCRT
option casemap:none 

.data
array1 dq  1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0
array2 dq  7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0
result dq  0.0
res dq 0.0
tit1 db "Result of using the SSE", 0
buf BYTE 260 dup(?)
bufp QWORD buf, 0
loop_count dq 7

.code
entry_point proc

    ; Load the two arrays into SSE registers
    movupd xmm1, [array1]
    movupd xmm2, [array2]
    mov rcx, loop_count ; Number of function iterations
    loop1:
    mulpd xmm1, xmm2
    addpd xmm3, xmm1
    movupd xmm1, [array1 + 8]
    movupd xmm2, [array2 + 8]
    loop loop1

    ; Add the result and store to xmm1
    addpd xmm1, xmm3

    ; Compute the square root of the sum of squares in xmm1
    sqrtpd xmm1, xmm1

    ; Move the result into a general-purpose register for output
    movsd res, xmm1

    invoke fptoa, res, bufp
    invoke MessageBox, 0, bufp, addr tit1, MB_OK
    invoke ExitProcess, 0
entry_point endp
end

I've tried to multiply two arrays without using the loop, just mulpd, but I guess this is not the best decision.

`movupd xmm1, [array1 + 8]` loads from the same place every iteration. You need a pointer or index in a register. (e.g. in RCX if you count up towards `loop_count` instead of using the slow `loop` instruction). Also, why are you loading 2 elements at once with `pd` (packed double) instead of `sd` (scalar double) instructions? At the end you use `movsd` to store just the low `double` element, so the upper halves were useless. If you wanted to use SSE for SIMD instead of scalar, you'd advance a pointer by 16 bytes (2 elements), but you'd need scalar cleanup if the array length is odd. — Peter Cordes, Mar 11 '23 at 22:29
*looks like it works fine* - Look more closely with a debugger at the values getting loaded into XMM registers; they're the same every iteration. Also, your "software pipelining" forgets to multiply the last vector, instead just adding those vectors. — Peter Cordes, Mar 12 '23 at 04:02
@PaulR : One reason I didn't tag [simd] on this question is that the loop iteration count matches the element count, and they're only using the low element of the result. Like they intended to use SSE scalar operations, but accidentally used `pd` instead for everything except the final `movsd` which only saves the low element. Since scalar SSE is the simplest and standard way to do FP math on x86-64, I don't think we should assume they intended SIMD, especially when the bugs are with even more basic things. — Peter Cordes, Mar 13 '23 at 10:02

SSE loop over array gets the wrong value (dot product of two arrays of doubles)

0 Answers0