Large run-to-run variance shown by a copy-loop implemented with MOVDQU

Question

I am seeking an explanation for results that I am seeing in a loop that moves 64bytes per-iteration, from some source memory location to some destination memory location, using the x86 movdqu instruction (movdqu instruction supports moving of 16byte data from/to xmm-registers to/from unaligned memory locations). This is part of code that implements a function that is similar to memcpy()/java.lang.System.arraycopy()

There are 2 different patterns that I tried to implement the copy with:

Pattern1

0x30013f74: prefetchnta BYTE PTR [rsi]
0x30013f77: prefetchnta BYTE PTR [rdi]
0x30013f7a: movdqu xmm3, XMMWORD PTR [rsi+0x30]
0x30013f7f: movdqu xmm2, XMMWORD PTR [rsi+0x20]
0x30013f84: movdqu XMMWORD PTR [rdi+0x30],xmm3
0x30013f89: movdqu XMMWORD PTR [rdi+0x20],xmm2
0x30013f8e: movdqu xmm1, XMMWORD PTR [rsi+0x10]
0x30013f93: movdqu xmm0, XMMWORD PTR [rsi]
0x30013f97: movdqu XMMWORD PTR [rdi+0x10], xmm1
0x30013f9c: movdqu XMMWORD PTR [rdi], xmm0

In this pattern, rsi holds the source (src) address, rdi holds the destination (dst) address, the xmm registers are used as temp registers. This code is iterated for as many times as copylen_in_bytes/64. As you can see, a ld-ld-st-st-ld-ld-st-st load-store pattern is followed here.

Pattern2

0x30013f74: prefetchnta BYTE PTR [rsi]
0x30013f77: prefetchnta BYTE PTR [rdi]
0x30013f7a: movdqu xmm3, XMMWORD PTR [rsi+0x30]
0x30013f7f: movdqu XMMWORD PTR [rdi+0x30], xmm3
0x30013f84: movdqu xmm2, XMMWORD PTR [rsi+0x20]
0x30013f89: movdqu XMMWORD PTR [rdi+0x20], xmm2
0x30013f8e: movdqu xmm1, XMMWORD PTR [rsi+0x10]
0x30013f93: movdqu XMMWORD PTR [rdi+0x10], xmm1
0x30013f98: movdqu xmm0, XMMWORD PTR [rsi]
0x30013f9c: movdqu XMMWORD PTR [rdi], xmm0

In pattern2, a ld-st-ld-st-ld-st-ld-st pattern is followed.

Observations

On running this code a few hundred times, where src and dst are aligned at various 8byte boundaries, I observe the following:

On Westmere (Xeon X5690)

Pattern1 exhibits very high run-to-run variance.

Pattern2 exhibits almost no variance.

The min-time (fastest observed time) on Pattern2 is higher(by ~8%) than the min-time on Pattern1.

On Ivybridge (Xean E5-2697 v2)

Pattern1 exhibits very high run-to-run variance.

Pattern2 exhibits almost no variance.

The min-time on Pattern2 is higher(~20%) than the min-time on Pattern1.

Haswell (Core i7-4770)

Pattern1 DOES NOT exhibit very high run-to-run variance.

Pattern2 exhibits almost no variance.

The min-time on Pattern2 is higher(~20%) than the min-time on Pattern1.

Strangely, on Westmere and Ivybridge, there seems to be no correlation between the alignment of src/dest and the bad-results (which cause the high variance). I see good and bad numbers for the same src/dest alignment.

Questions

I understand that a cacheline-spanning movdqu will perform worse than a non-cacheline-spanning movdqu, but I don't understand the following:

1) Why does Pattern1 exhibit high variance on Westmere and Ivybridge ? How does the order of the load-stores make the difference ?

2) Why are the min-times on Pattern2 slower than Pattern1, across the different architectures ?

Thanks for taking the time to read this long post.

Karthik

`movdqu` loads / stores 16 bytes, and is faster when used with addresses that are 16B-aligned, I think. Actually, I forget: other than cacheline splits, unaligned accesses may be the same speed as aligned, on more recent CPUs. It's been a while since I carefully read Agner Fog's docs. — Peter Cordes, Jun 24 '15 at 14:18
What happens if you don't use `prefetchnta`? Is it having any effect on average speed, and/or variance? What about if you used `movnta` instead of `movdqu`? Err, that would probably fault on unaligned, and there's no `movntu` available. — Peter Cordes, Jun 24 '15 at 14:21
Thanks for your interest. **16B alignment and MOVDQU** On pg 2-45 of , [http://goo.gl/v5kDzZ] Table 2-25, shows that cycle-cost for movdqu went from {2, ~2, 20} (aligned16B, unaligned16B, split-cache line) cycles pre-haswell, to {1, 1, 4} on haswell. So no penalty for doing an 16B unaligned, compared to an 16B aligned. And in haswell, the penalty for split-cacheline access is lesser compared to older architectures. This would explain the results if, Pattern1 is causing more split-cacheline accesses, than Pattern 2. But that can't be the case. — Karthik M, Jun 24 '15 at 22:22
**Effect of prefetchnta** Haswell : Having prefetchnta has no effect on variance, and a big positive impact (~40%) on min-times and average. Westmere : prefetchnta has no effect on variance, and no effect on min times and average. Nehalem : Not having prefetchnta causes large variance. Having prefetchnta causes about 15% negative impact on min times. — Karthik M, Jun 24 '15 at 22:23
Just to be clear on positive and negative: On Haswell, it's 40% faster with `prefetchnta`, while on Nehalem including `prefetchnta` makes it 15% slower? — Peter Cordes, Jun 25 '15 at 06:15
I thought that for `prefetchnta` to do much good, you should be prefetching a cacheline ahead of the one you're about to read/write. I tried to find stuff on intel's web site, but mostly I found stuff about using `movnt...`, not prefetch. https://software.intel.com/en-us/forums/topic/306094. I thought I saw a doc somewhere about using non-temporal accesses, and worrying about how many fill buffers were available, and so on, but it didn't turn up right away with google. — Peter Cordes, Jun 25 '15 at 06:22
And no, I don't have any ideas that make sense for why pattern1 should get different results from pattern2. Your loop should fit in the 28uop loop buffer (present on Nehalem and later), and you're not going to get 4 uops per cycle anyway. More like 2, since every other uop is a store, and there's only one store port. — Peter Cordes, Jun 25 '15 at 06:24
Have you tried using the performance counters? `perf` on linux, or vtune. Or possibly even Intel's code analyzer (IACA). — Peter Cordes, Jun 25 '15 at 06:24
Your interpretation of my usage of positive and negative, is correct. I will try using the perf counters and post an update. — Karthik M, Jun 26 '15 at 16:04

Large run-to-run variance shown by a copy-loop implemented with MOVDQU

Pattern1

Pattern2

Observations

On Westmere (Xeon X5690)

On Ivybridge (Xean E5-2697 v2)

Haswell (Core i7-4770)

Questions

0 Answers0