Memory performance/cache puzzle

Question

I have a memory performance puzzle. I'm trying to benchmark how long it takes to fetch a byte from main memory, and how various BIOS settings and memory hardware parameters influence it. I wrote the following code for Windows that, in a loop, flushes the cache by reading another buffer and then reads a target buffer a single byte at a time with varying strides. I figure that once the stride is the cache line size, that's the quantity I'm trying to measure since each read goes to main memory. Here's the benchmark code (note that the size of the buffer is the stride x 1MB and that I pin the thread to core 1):

#include <stdio.h>
#include <memory.h>

#define NREAD       (1024*1024)
#define CACHE_SIZE  (50*1024*1024)

char readTest(int stride) {
    LARGE_INTEGER frequency;
    LARGE_INTEGER start;
    LARGE_INTEGER end;
    int rep, i,ofs;
    double time, min_time=1e100, max_time=0.0, mean_time=0.0;
    char *buf = (char *)malloc(NREAD*stride);
    char *flusher = (char *)malloc(CACHE_SIZE); 
    char jnk=0;
    for(rep=0; rep<255; rep++) {
        // read the flusher to flush the cache
        for(ofs = 0; ofs<CACHE_SIZE; ofs+=64) jnk+=flusher[ofs];
        if (QueryPerformanceFrequency(&frequency) == FALSE) exit(-1);
        if (QueryPerformanceCounter(&start) == FALSE) exit(-2);

        // here's the timed loop
        for(ofs=0; ofs<NREAD*stride; ofs+=stride) jnk += buf[ofs];

        if (QueryPerformanceCounter(&end) == FALSE) exit(-3);
        time = (double)(end.QuadPart - start.QuadPart) / (double)frequency.QuadPart*1e6;
        max_time = time > max_time ? time : max_time;
        min_time = time < min_time ? time : min_time;
        mean_time += time;
    }
    mean_time /= 255;
    printf("Stride = %4i, Max: %6.0f us, Min: %6.0f us, Mean: %6.0f us, B/W: %4.0f MB/s\n", stride, max_time, min_time, mean_time, NREAD/min_time);
    free(buf);
    free(flusher);
    return jnk;
}

int main(int argc, char* argv[]) {
    SetThreadAffinityMask(GetCurrentThread(), 1);  // pin to core 1 to avoid weirdness
    // run the tests
    readTest(1);    readTest(2);    readTest(4);    readTest(6);    readTest(8);
    readTest(12);   readTest(16);   readTest(24);   readTest(32);   readTest(48);
    readTest(64);   readTest(96);   readTest(128);  readTest(192);  readTest(256);
    readTest(384);  readTest(512);  readTest(768);  readTest(1024); readTest(1536);
    return 0;
}

The inner loop that is timed assembles as:

        // here's the timed loop
        for(ofs=0; ofs<NREAD*stride; ofs+=stride) jnk += buf[ofs];
00F410AF  xor         eax,eax  
00F410B1  test        edi,edi  
00F410B3  jle         readTest+0C2h (0F410C2h)  
00F410B5  mov         edx,dword ptr [buf]  
00F410B8  add         bl,byte ptr [eax+edx]  
00F410BB  add         eax,dword ptr [stride]  
00F410BE  cmp         eax,edi  
00F410C0  jl          readTest+0B5h (0F410B5h)

I ran this on a dual-processor E5-2609 machine, and here are the results:

Stride =    1, Max:   2362 us, Min:    937 us, Mean:    950 us, B/W: 1119 MB/s
Stride =    2, Max:   1389 us, Min:    968 us, Mean:    978 us, B/W: 1083 MB/s
Stride =    4, Max:   1694 us, Min:   1026 us, Mean:   1037 us, B/W: 1022 MB/s
Stride =    6, Max:   2418 us, Min:   1098 us, Mean:   1124 us, B/W:  955 MB/s
Stride =    8, Max:   2835 us, Min:   1234 us, Mean:   1252 us, B/W:  850 MB/s
Stride =   12, Max:   4203 us, Min:   1527 us, Mean:   1559 us, B/W:  687 MB/s
Stride =   16, Max:   5130 us, Min:   1816 us, Mean:   1849 us, B/W:  577 MB/s
Stride =   24, Max:   7370 us, Min:   2408 us, Mean:   2449 us, B/W:  435 MB/s
Stride =   32, Max:  10039 us, Min:   2901 us, Mean:   3014 us, B/W:  361 MB/s
Stride =   48, Max:  14248 us, Min:   4652 us, Mean:   4731 us, B/W:  225 MB/s
Stride =   64, Max:  19149 us, Min:   6340 us, Mean:   6447 us, B/W:  165 MB/s
Stride =   96, Max:  28848 us, Min:   8475 us, Mean:   8615 us, B/W:  124 MB/s
Stride =  128, Max:  37449 us, Min:   9900 us, Mean:  10160 us, B/W:  106 MB/s
Stride =  192, Max:  51718 us, Min:  11282 us, Mean:  11563 us, B/W:   93 MB/s
Stride =  256, Max:  62193 us, Min:  11558 us, Mean:  11924 us, B/W:   91 MB/s
Stride =  384, Max:  86943 us, Min:  11829 us, Mean:  12260 us, B/W:   89 MB/s
Stride =  512, Max: 108661 us, Min:  11847 us, Mean:  12401 us, B/W:   89 MB/s
Stride =  768, Max: 167951 us, Min:  11797 us, Mean:  12946 us, B/W:   89 MB/s
Stride = 1024, Max: 211700 us, Min:  12893 us, Mean:  13979 us, B/W:   81 MB/s
Stride = 1536, Max: 332214 us, Min:  12967 us, Mean:  15077 us, B/W:   81 MB/s

Here are my questions:

Why does the performance continue to degrade after the stride is larger than the cache-line size (64 bytes for Sandy Bridge)? I would assume that the worst performance would occur once the stride is large enough to require a cache-line transfer for every read, but even after that the time increases by a factor of two... What am I missing?
Why is the max time (which happens on the first iteration of the loop) 2-4x longer than the minimum time? I'm flushing the cache every iteration...

Could you just post all timing results? I'm curious to see them. — usr, Nov 05 '13 at 18:46
You could remove the flushing by vastly increasing the size of the buffer under test (to hundreds of MBs). That removes complexity and causes of bugs. You could also use large pages to reduce TLB usage. You could also increase thread priority to HIGH. — usr, Nov 05 '13 at 18:48

score 2 · Answer 1 · answered Nov 05 '13 at 18:48

Cachelines are not the only granularity at which memory is tracked. Translation from virtual to physical address happens at page granularity. Your system is almost certainly using 4k pages.

At a stride of 64, you get 64 entries per page, so you have 16384 pages. The L2 TLB can only track 512 of those pages, so you take a L2 TLB miss on each new page (every 64th access).

At a stride of 1024, you get 4 entries per page, so you have 262144 pages. Now you get a L2 TLB miss on every 4th access.

tl;dr: TLB misses are killing you. You can use the perf counters to observe this directly instead of having Stack Overflow read the tea leaves for you. You can also get your system to allocate the buffer using one or more “superpages” to extend your TLB reach (though different systems have varying degrees of support for this feature).

Okay, thanks for the tip... I'll have to do some more reading. Follow up question: should the floor be at a TLB miss per access, or a stride of 4K? Performance keeps degrading up to a stride of 2MB. — Andrew, Nov 05 '13 at 20:10
@Andrew, A stride of 4k or above means a TLB miss every on every access if your pages are 4k. It's possible to build your pagemap for 2M pages, which might result in a degradation there instead (although the degradation would still be gradual as Stephen noted, since as you grow the stride you'll have less reuses for each translation). — Leeor, Nov 06 '13 at 15:43

Leeor · Answer 2 · 2013-11-05T18:55:28.843

The degradation would continue after a line size due to prefetching - as long as you stride over cache lines in a steady stream (or even with strides), you'd enjoy the benefits of HW prefetchers bringing over the next few lines. The L2 streamer is especially useful as is would run faster than your stream of accesses.
However, once your stride is over 128 byte, you should start running ahead of the stream prefetcher, and would incur the full latency on every access.
To make sure this is indeed the case - disable prefetching (hopefully your system allows this in the BIOS) EDIT: Stephen raises a very good point about the ratio of access to TLB lookup as well - this would account for the large strides. If you plot the time per stride, i'm willing to wager you'd see a strong trend due to the TLB miss rate, and on top of that a jump between 64 and 128 byte strides.
I believe your first iteration is longer due to a cold TLB. You can test this by trying to flush that as well (hard..), or by running a warmup iteration and measuring only from the second one.

Memory performance/cache puzzle

2 Answers2