I am seeking an explanation for results that I am seeing in a loop that moves 64bytes per-iteration, from some source memory location to some destination memory location, using the x86 movdqu instruction (movdqu instruction supports moving of 16byte data from/to xmm-registers to/from unaligned memory locations). This is part of code that implements a function that is similar to memcpy()/java.lang.System.arraycopy()
There are 2 different patterns that I tried to implement the copy with:
Pattern1
0x30013f74: prefetchnta BYTE PTR [rsi]
0x30013f77: prefetchnta BYTE PTR [rdi]
0x30013f7a: movdqu xmm3, XMMWORD PTR [rsi+0x30]
0x30013f7f: movdqu xmm2, XMMWORD PTR [rsi+0x20]
0x30013f84: movdqu XMMWORD PTR [rdi+0x30],xmm3
0x30013f89: movdqu XMMWORD PTR [rdi+0x20],xmm2
0x30013f8e: movdqu xmm1, XMMWORD PTR [rsi+0x10]
0x30013f93: movdqu xmm0, XMMWORD PTR [rsi]
0x30013f97: movdqu XMMWORD PTR [rdi+0x10], xmm1
0x30013f9c: movdqu XMMWORD PTR [rdi], xmm0
In this pattern, rsi holds the source (src) address, rdi holds the destination (dst) address, the xmm registers are used as temp registers. This code is iterated for as many times as copylen_in_bytes/64. As you can see, a ld-ld-st-st-ld-ld-st-st load-store pattern is followed here.
Pattern2
0x30013f74: prefetchnta BYTE PTR [rsi]
0x30013f77: prefetchnta BYTE PTR [rdi]
0x30013f7a: movdqu xmm3, XMMWORD PTR [rsi+0x30]
0x30013f7f: movdqu XMMWORD PTR [rdi+0x30], xmm3
0x30013f84: movdqu xmm2, XMMWORD PTR [rsi+0x20]
0x30013f89: movdqu XMMWORD PTR [rdi+0x20], xmm2
0x30013f8e: movdqu xmm1, XMMWORD PTR [rsi+0x10]
0x30013f93: movdqu XMMWORD PTR [rdi+0x10], xmm1
0x30013f98: movdqu xmm0, XMMWORD PTR [rsi]
0x30013f9c: movdqu XMMWORD PTR [rdi], xmm0
In pattern2, a ld-st-ld-st-ld-st-ld-st pattern is followed.
Observations
On running this code a few hundred times, where src and dst are aligned at various 8byte boundaries, I observe the following:
On Westmere (Xeon X5690)
Pattern1 exhibits very high run-to-run variance.
Pattern2 exhibits almost no variance.
The min-time (fastest observed time) on Pattern2 is higher(by ~8%) than the min-time on Pattern1.
On Ivybridge (Xean E5-2697 v2)
Pattern1 exhibits very high run-to-run variance.
Pattern2 exhibits almost no variance.
The min-time on Pattern2 is higher(~20%) than the min-time on Pattern1.
Haswell (Core i7-4770)
Pattern1 DOES NOT exhibit very high run-to-run variance.
Pattern2 exhibits almost no variance.
The min-time on Pattern2 is higher(~20%) than the min-time on Pattern1.
Strangely, on Westmere and Ivybridge, there seems to be no correlation between the alignment of src/dest and the bad-results (which cause the high variance). I see good and bad numbers for the same src/dest alignment.
Questions
I understand that a cacheline-spanning movdqu will perform worse than a non-cacheline-spanning movdqu, but I don't understand the following:
1) Why does Pattern1 exhibit high variance on Westmere and Ivybridge ? How does the order of the load-stores make the difference ?
2) Why are the min-times on Pattern2 slower than Pattern1, across the different architectures ?
Thanks for taking the time to read this long post.
Karthik