Fully optimized memcpy/memmove for Core 2 or Core i7 architecture?

Question

The theoretical maximum of memory bandwidth for a Core 2 processor with DDR3 dual channel memory is impressive: According to the Wikipedia article on the architecture, 10+ or 20+ gigabytes per second. However, stock memcpy() calls do not attain this. (3 GB/s is the highest I've seen on such systems.) Likely, this is due to the OS vendor requirement that memcpy() be tuned for every processor line based on the processor's characteristics, so a stock memcpy() implementation should be reasonable on a wide number of brands and lines.

My question: Is there a freely available, highly tuned version for Core 2 or Core i7 processors that can be utilized in a C program? I'm sure that I'm not the only person in need of one, and it would be a big waste of effort for everyone to micro-optimize their own memcpy().

score 7 · Answer 1 · answered Mar 05 '09 at 02:33

When measuring bandwidth did you take into account memcpy was both a read and a write, so 3 GB/s of memory copied is actually 6 GB/s of bandwidth?

Remember, the bandwidth is theoretical maximum - real world use will be much lower. For instance, one page fault and your bandwidth will drop to MB/s.

memcpy/memmove are compiler intrinsics and will usually be inlined to rep movsd (or the appropriate SSE instructions if your compiler can target that). It may be impossible to improve the codegen over this, since modern CPU's will handle rep instructions like this very, very well.

score 6 · Accepted Answer · answered Mar 05 '09 at 01:50

6

If you specify /ARCH:SSE2 to MSVC it should provide you with a tuned memcpy (at least, mine does).

Failing that, use the SSE aligned load/store intrinsics yourself to copy the memory in large chunks, employing a Duff's Device of word reads where necessary to deal with the head and tail of data to get it to an aligned boundary. You'll need to use the cache management intrinsics as well to get good performance.

Your limiting factor is probably cache misses and southbridge bandwidth, rather than CPU cycles. Given that there's always going to be lots of other traffic on the memory bus, I'm usually happy to get to about 90% of theoretical memory bandwidth throughput in such operations.

answered Mar 05 '09 at 01:50

Crashworks

40,496
12
101
170

1

The MSVC memcpy is vectorized when these conditions are met (roughly, I'm not an expert on this): Both source and dest addresses are at least 8-byte (64-bit) aligned, and the movement size is above a certain threshold. The 64-bit alignment comes from MSVC's guarantee that its own `malloc` returns 64-bit alignment. Then, on 32-bit builds, 128-bit SSE2 will be used (with 64-bit shuffle if required), and on 64-bit builds, it will use 64-bit general purpose registers (with Duff's device) to do the movement because when it's done properly it's "fast enough" compared to SSE2. – rwong Sep 13 '13 at 21:27
1

/arch: minimum CPU architecture requirements, one of: SSE2 - (default) enable use of instructions available with SSE2 enabled CPUs This is visual studio vs2013.update3's vc++ output. In which SSE2 is the default, based on my benchmark, use /ARCH:SSE2 will not improve memcpy performance, and I benchmarked, even /ARCH:AVX not improve memcpy performance. – zhaorufei Dec 10 '15 at 02:22

Mitch Wheat · Answer 3 · 2012-06-12T03:43:43.120

2

You could write your own. Try using the intel optimising compiler to directly target the architecture?

Intel also produce something called VTune (compiler and language independent) for optimising applications.

Here's an article on optimising a game engine.

edited Jun 12 '12 at 03:43

answered Mar 05 '09 at 01:50

Mitch Wheat

295,962
43
465
541

How much improvement does Intel's compiler provide over gcc with the same optimization switch ... say -O3? – Tim Post Mar 05 '09 at 02:15
Last time I tried it, about 35%, but that was a few years ago. – Crashworks Mar 05 '09 at 02:25

Fully optimized memcpy/memmove for Core 2 or Core i7 architecture?

3 Answers3

Linked