My question is based on another SO question: Why does _mm_stream_ps produce L1/LL cache misses?
After reading it and being intrigued by it, I tried to replicate the results and see for myself which was faster: naive loop, unrolled naive loop, _mm_stream_ps
(unrolled), _mm_store_ps
(unrolled) and last but not least memset_pattern4
. (the last one takes a 4 byte pattern, such as a float, and plasters it all over the destination array, which should do the same as all the other functions, it's probably OS X exclusive though).
I've made sure to align the start of my array on a cacheline (64 bytes, I checked) and pass the array in an argument as well as any other performance tweaks the were mentioned in the previous question.
Somebody else wanted to know the same thing on gamedev: http://www.gamedev.net/topic/532112-fast-memset/
The conclusions of that thread mirror my own: when the destination array is smaller than the largest (L3) cache, _mm_store_ps
is faster than _mm_stream_ps
. When the destination array is larger, _mm_stream_ps
is faster. I'm not entirely sure why __mm_store_ps
is faster in the first case, since I never use those values in the cache, but I get why _mm_stream_ps
wins out in the latter case. It's made for this situation: write bytes to memory which you won't need immediately (or ever).
Here are some results with a destination array 256 times larger than L3 cache (in my case, 1.5GB), compiled with gcc 4.8:
gcc-4.8 stream.c -o stream -std=c11 -O3 -g3 -ftree-vectorize -march=native -minline-all-stringops && ./stream
bench L3-MASS, array 1610612736 bytes (402653184 floats, 0 remainder, 0x104803040 pointer)
warm up round...
6% ( 20.81148 ms) : MEMSET CHEAT
8% ( 28.49419 ms) : MEMSET PATTER
100% ( 371.40385 ms) : NAIVE NORMAL
54% ( 202.01147 ms) : NAIVE UNROLL
31% ( 113.53433 ms) : STREAM NORMAL
30% ( 111.41691 ms) : STREAM UNROLL
51% ( 190.70412 ms) : STORE NORMAL
51% ( 189.15338 ms) : STORE UNROLL
51% ( 189.36182 ms) : STORE PREFET
So what do we learn from this? memset_pattern4
is unbelievably fast. I included bog-standard memset
even though it just uses a 1-byte pattern for comparison. In essence, memset
cheats, but memset_pattern4
does not, and it's still wicked fast.
I've tried looking at the assembly for what I believe is the source code to memset_pattern4
in the OS X string library:
- Apple's libc,
memset_pattern4
: http://www.opensource.apple.com/source/Libc/Libc-825.25/string/memset_pattern.c?txt - This references a so-called bcopy function. Let's dig for that: String library: http://www.opensource.apple.com/source/Libc/Libc-763.13/x86_64/string/
- And most likely the SSE 4.2 version is being used in my case: http://www.opensource.apple.com/source/Libc/Libc-763.13/x86_64/string/bcopy_sse42.s
My knowledge of asm reaches (by now) far enough that I see they're using the movdqa
instruction where it matters (in the LAlignedLoop
section), which is basically a SSE move instruction for integers (not floats), intrinsic: _mm_store_si128
. Not that it should matter here, bits and bytes, right?
- There also seems to be a pure asm implementation of
memset_pattern4
, which seems different as it doesn't callbcopy
: http://www.opensource.apple.com/source/Libc/Libc-763.13/x86_64/string/memset.s (EDIT: this is the correct one, as verified by running under gdb)
...damn, this one seems to use non-temporal (_mm_stream_ps
stores for very long arrays => movntdq %xmm0,(%rdi,%rcx)...
, look in the LVeryLong
section of the funcion), which is exactly what I do! So how can that be it much faster? Maybe that's not the memset_pattern4
I'm looking for.
So, what is memset_pattern4
doing under the hood and why is it 5x faster than my best attempt? Even though I've been trying to learn enough x86 assembly to be able to dissect the function I'm afraid it's a bit out of my league for now to debug performance issues in optimized-to-death functions.
NOTE: for those curious, this microbenchmark also serves to illustrate the sheer awesomeness of clang and its advanced vectorization (-fslp-vectorize
), it manages to make the naive loop the fastest one save for memset in almost all cases. It seems to be about as good as the best combination of _mm_store_ps
and _mm_stream_ps
.
CODE: here's the code I use to perform my benchmark (as gist: https://gist.github.com/6571379):
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>
/**
* compile and run:
*
* OSX:
* clang stream.c -o stream -std=c11 -O3 -g -ftree-vectorize -fslp-vectorize -march=native -minline-all-stringops && ./stream
* gcc-4.8 stream.c -o stream -std=c11 -O3 -g3 -ftree-vectorize -march=native -minline-all-stringops && ./stream
*
* linux:
* clang stream.c -o stream -lrt -std=c11 -O3 -ftree-vectorize -fslp-vectorize -march=native && ./stream
* gcc-4.8 stream.c -o stream -lrt -std=c11 -O3 -ftree-vectorize -march=native && ./stream
*
* to generate the assembly:
* gcc-4.8 -S stream.c -o stream.s -std=c11 -O3 -g3 -ftree-vectorize -march=native -minline-all-stringops
* gobjdump -dS stream > stream.obj.s
*
* clang is the (very clear) winner here, the SLP vectorizer is absolutely killer, it even turns the
* plain naive loop into something hyper-performant
*/
/* posix headers */
#include <sys/time.h>
/* intrinsics */
#include <x86intrin.h>
#define ARRAY_SIZE(x) ((sizeof(x)/sizeof(0[x])) / ((size_t)(!(sizeof(x) % sizeof(0[x])))))
/**
* some stats from my system
*
* sudo sysctl -a | grep cache
*
* hw.cachelinesize = 64
* hw.l1icachesize = 32768
* hw.l1dcachesize = 32768
* hw.l2cachesize = 262144
* hw.l3cachesize = 6291456
*/
/* most processors these days (2013) have a 64 byte cache line */
#define FACTOR 1024
#define CACHE_LINE 64
#define FLOATS_PER_LINE (CACHE_LINE / sizeof(float))
#define L1_CACHE_BYTES 32768
#define L2_CACHE_BYTES 262144
#define L3_CACHE_BYTES 6291456
#ifdef __MACH__
#include <mach/mach_time.h>
double ns_conversion_factor;
double us_conversion_factor;
double ms_conversion_factor;
void timeinit() {
mach_timebase_info_data_t timebase;
mach_timebase_info(&timebase);
ns_conversion_factor = (double)timebase.numer / (double)timebase.denom;
us_conversion_factor = (double)timebase.numer / (double)timebase.denom / 1000;
ms_conversion_factor = (double)timebase.numer / (double)timebase.denom / 1000000;
}
double nsticks() {
return mach_absolute_time() * ns_conversion_factor;
}
double msticks() {
return mach_absolute_time() * ms_conversion_factor;
}
#else
void timeinit() {
/* do nothing */
}
double nsticks() {
timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ((double)ts.tv_sec) / 1000000000 + ((double)ts.tv_nsec);
}
double msticks() {
timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ((double)ts.tv_sec) / 1000 + ((double)ts.tv_nsec) * 1000000;
}
#endif
void *aligned_malloc(size_t size, size_t alignment) {
void *pa, *ptr;
pa = malloc((size+alignment-1)+sizeof(void *));
if (!pa) return NULL;
ptr=(void*)( ((intptr_t)pa+sizeof(void *)+alignment-1)&~(alignment-1) );
*((void **)ptr-1)=pa;
return ptr;
}
void aligned_free(void *ptr) {
if (ptr) free(*((void **)ptr-1));
}
void pollute_cache(uint8_t volatile *arr, size_t length) {
for (int i = 0; i < length; ++i) {
arr[i] = (arr[i] > 0xFE) ? 0xAA : 0x55;
}
}
void pollute_cache_standalone() {
const size_t pollute_len = 2 * L3_CACHE_BYTES;
uint8_t *arr = aligned_malloc(pollute_len * sizeof(uint8_t), 64);
for (int i = 0; i < pollute_len; ++i) {
arr[i] = (arr[i] > 0xFE) ? 0xAA : 0x55;
}
aligned_free(arr);
}
/**
* returns the time passed, in milliseconds
*/
double tim(const char *name, double baseline, void (*pre)(void), void (*func)(float *, size_t), float * restrict arr, size_t length) {
struct timeval t1, t2;
if (pre) pre();
const double ms1 = msticks();
func(arr, length);
const double ms2 = msticks();
const double ms = (ms2 - ms1);
if (baseline == -2.0) return ms;
/* first run, equal to baseline (itself) by definition */
if (baseline == -1.0) baseline = ms;
if (baseline != 0.0) {
fprintf(stderr, "%7.0f%% (%10.5f ms) : %s\n", (ms / baseline) * 100, ms, name);
}
else {
fprintf(stderr, "%7.3f ms : %s\n", ms, name);
}
return ms;
}
void func0(float * const restrict arr, size_t length) {
memset(arr, 0x05, length);
}
#ifdef __MACH__
void funcB(float * const restrict arr, size_t length) {
const float val = 5.0f;
memset_pattern4(arr, &val,length);
}
#endif
void func1(float * const restrict arr, size_t length) {
for (int i = 0; i < length; ++i) {
arr[i] = 5.0f;
}
}
void func2(float * const restrict arr, size_t length) {
for(int i = 0; i < length; i += 4) {
arr[i] = 5.0f;
arr[i+1] = 5.0f;
arr[i+2] = 5.0f;
arr[i+3] = 5.0f;
}
}
void func3(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 4) {
_mm_stream_ps(&arr[i], buf);
}
_mm_mfence();
}
void func4(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 16) {
_mm_stream_ps(&arr[i + 0], buf);
_mm_stream_ps(&arr[i + 4], buf);
_mm_stream_ps(&arr[i + 8], buf);
_mm_stream_ps(&arr[i + 12], buf);
}
_mm_mfence();
}
void func5(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 4) {
_mm_store_ps(&arr[i], buf);
}
}
void fstore_prefetch(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 16) {
__builtin_prefetch(&arr[i + FLOATS_PER_LINE * 32], 1, 0);
_mm_store_ps(&arr[i + 0], buf);
_mm_store_ps(&arr[i + 4], buf);
_mm_store_ps(&arr[i + 8], buf);
_mm_store_ps(&arr[i + 12], buf);
}
}
void func6(float * const restrict arr, size_t length) {
const __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 16) {
_mm_store_ps(&arr[i + 0], buf);
_mm_store_ps(&arr[i + 4], buf);
_mm_store_ps(&arr[i + 8], buf);
_mm_store_ps(&arr[i + 12], buf);
}
}
#ifdef __AVX__
void func7(float * restrict arr, size_t length) {
const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 8) {
_mm256_stream_ps(&arr[i], buf);
}
}
void func8(float * restrict arr, size_t length) {
const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 32) {
_mm256_stream_ps(&arr[i + 0], buf);
_mm256_stream_ps(&arr[i + 8], buf);
_mm256_stream_ps(&arr[i + 16], buf);
_mm256_stream_ps(&arr[i + 24], buf);
}
}
void func9(float * restrict arr, size_t length) {
const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 8) {
_mm256_store_ps(&arr[i], buf);
}
}
void funcA(float * restrict arr, size_t length) {
const __m256 buf = _mm256_setr_ps(5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f, 5.0f);
for (int i = 0; i < length; i += 32) {
_mm256_store_ps(&arr[i + 0], buf);
_mm256_store_ps(&arr[i + 8], buf);
_mm256_store_ps(&arr[i + 16], buf);
_mm256_store_ps(&arr[i + 24], buf);
}
}
#endif
void bench(const char * restrict name, float * restrict arr, size_t length) {
fprintf(stderr, "bench %s, array %zu bytes (%zu floats, %zu remainder, %p pointer)\n", name, length, length / sizeof(float), length % sizeof(float), arr);
size_t nfloats = length / sizeof(float);
fprintf(stderr, "warm up round...");
func1(arr, nfloats);
fprintf(stderr, "done\n");
double baseline = tim("func1: NAIVE ", -2.0, NULL, func1, arr, nfloats);
tim("MEMSET CHEAT ", baseline, NULL, func0, arr, nfloats);
#ifdef __MACH__
tim("MEMSET PATTER", baseline, NULL, funcB, arr, nfloats);
#endif
tim("NAIVE NORMAL", -1.0, NULL, func1, arr, nfloats);
tim("NAIVE UNROLL", baseline, NULL, func2, arr, nfloats);
tim("STREAM NORMAL", baseline, NULL, func3, arr, nfloats);
tim("STREAM UNROLL", baseline, NULL, func4, arr, nfloats);
tim("STORE NORMAL", baseline, NULL, func5, arr, nfloats);
tim("STORE UNROLL", baseline, NULL, func6, arr, nfloats);
tim("STORE PREFET", baseline, NULL, fstore_prefetch, arr, nfloats);
// for (int i = 0; i < 1; ++i) {
// tim("func0: MEMSET (cache polluted)", NULL, func0, arr, nfloats);
// tim("func1: NAIVE (cache polluted)", pollute_cache_standalone, func1, arr, nfloats);
// tim("func2: UNROLL (cache polluted)", pollute_cache_standalone, func2, arr, nfloats);
// tim("func3: STREAM (cache polluted)", pollute_cache_standalone, func3, arr, nfloats);
// tim("func4: STRUN (cache polluted)", pollute_cache_standalone, func4, arr, nfloats);
// tim("func5: STORE (cache polluted)", pollute_cache_standalone, func5, arr, nfloats);
// tim("func6: STOUN (cache polluted)", pollute_cache_standalone, func6, arr, nfloats);
// }
}
int main() {
timeinit();
static const struct {
const char *name;
size_t bytes;
} sizes[] = {
{ "L1-HALF", L1_CACHE_BYTES / 2 },
{ "L1-FULL", L1_CACHE_BYTES },
{ "L2-HALF", L2_CACHE_BYTES / 2 },
{ "L2-FULL", L2_CACHE_BYTES },
{ "L3-HALF", L3_CACHE_BYTES / 2 },
{ "L3-FULL", L3_CACHE_BYTES },
{ "L3-DOUB", L3_CACHE_BYTES * 2 },
{ "L3-HUGE", L3_CACHE_BYTES * 64 },
{ "L3-MASS", L3_CACHE_BYTES * 256 }
};
for (int i = 0; i < ARRAY_SIZE(sizes); ++i) {
size_t bytes = sizes[i].bytes;
/* align to cache line */
float *arr = aligned_malloc(bytes, CACHE_LINE);
bench(sizes[i].name, arr, bytes);
aligned_free(arr);
}
return 0;
}
EDIT: I went digging a bit further and after editing the assembly that gcc generates to make it more or less the same as the one apple uses (memset.s
, label LVeryLong
, i.e.: 4 unrolled movntdq
instructions in a tight loop). To my surprise, I receive equal performance as my functions that use _mm_store_ps
(movaps
). This perplexes me, as I would have expected it to be either
- as fast as
memset_pattern4
(presumably unrolledmovntdq
) - as fast as unrolled
_mm_stream_ps
(movntdq
)
But no, it seems to be the same as _mm_store_ps
, imagine that, maybe I'm doing something wrong. Running objdump on the resulting binary confirms that it is using movntdq
, which suprises me even more, what the hell is going on?
Because I hit a dead end there, I decided to step through the executable in a debugger and setup a breakpoint at memset_pattern4
. Stepping into the function, I noticed that it does exactly what I thought it would do, a tight loop with four unrolled movntdq
:
0x00007fff92a5f7d2 <+318>: jmp 0x7fff92a5f7e0 <memset_pattern4+332>
0x00007fff92a5f7d4 <+320>: nopw 0x0(%rax,%rax,1)
0x00007fff92a5f7da <+326>: nopw 0x0(%rax,%rax,1)
0x00007fff92a5f7e0 <+332>: movntdq %xmm0,(%rdi,%rcx,1)
0x00007fff92a5f7e5 <+337>: movntdq %xmm0,0x10(%rdi,%rcx,1)
0x00007fff92a5f7eb <+343>: movntdq %xmm0,0x20(%rdi,%rcx,1)
0x00007fff92a5f7f1 <+349>: movntdq %xmm0,0x30(%rdi,%rcx,1)
0x00007fff92a5f7f7 <+355>: add $0x40,%rcx
=> 0x00007fff92a5f7fb <+359>: jne 0x7fff92a5f7e0 <memset_pattern4+332>
0x00007fff92a5f7fd <+361>: sfence
So, what makes Apple's sauce that much more magical than mine, I wonder...
EDIT 2: I was wrong twice here, Apple's magic sauce is not that magical, I was just passing in an array that was 4x smaller than what I was passing to my functions. Courtesy to @PaulR for noticing! Secondly I was editing the assembly of the function, but gcc had already inlined it. So I was editing a copy that was never used.
CONCLUSION:
Some other things I found out:
- Clang and gcc are really good, with the right intrinsics, they optimize a lot (and clang even does a great job without intrinsics, when the SLP vectorizer is enabled). They will also inline function pointers.
- Clang will replace a naive loop with a constant into a
memset
call, clearing up another confusing result I had. - non-temporal store (i.e.: stream) is only beneficial with huge writes
memset
is really well optimized, it will automatically switch between regular store and non-temporal store (stream) based on the length of the array to write. I'm not sure how much of this is true on platforms other than OSX- when writing a benchmark, make absolutely sure the function does what you think it does, and that the compiler is not outsmarting you. The first case was my problem here, I had not supplied the correct arguments.
EDIT: I recently stumbled upon the intel optimization guide, if at all interested in these things, read some parts of this first (start at 3.7.6, perhaps).