Reason for collapse of memory bandwidth when 2KB of data is cached in L1-cache

Question

In a self-educational project I measure the bandwidth of the memory with help of the following code (here paraphrased, the whole code follows at the end of the question):

unsigned int doit(const std::vector<unsigned int> &mem){
   const size_t BLOCK_SIZE=16;
   size_t n = mem.size();
   unsigned int result=0;
   for(size_t i=0;i<n;i+=BLOCK_SIZE){           
             result+=mem[i];
   }
   return result;
}

//... initialize mem, result and so on
int NITER = 200; 
//... measure time of
   for(int i=0;i<NITER;i++)
       resul+=doit(mem)

BLOCK_SIZE is choosen in such a way, that a whole 64byte cache line is fetched per single integer-addition. My machine (an Intel-Broadwell) needs about 0.35 nanosecond per integer-addion, so the code above could saturate a bandwith as high as 182GB/s (this value is just an upper bound and is probably quite off, what is important is the ratio of bandwidths for different sizes). The code is compiled with g++ and -O3.

Varying the size of the vector, I can observe expected bandwidths for L1(*)-, L2-, L3-caches and the RAM-memory:

However, there is an effect I'm really struggling to explain: the collapse of the measured bandwidth of L1-cache for sizes around 2 kB, here in somewhat higher resolution:

I could reproduce the results on all machines I have access to (which have Intel-Broadwell and Intel-Haswell processors).

My question: What is the reason for the performance-collapse for memory-sizes around 2 KB?

(*) I hope I understand correctly, that for L1-cache not 64 bytes but only 4 bytes per addition are read/transfered (there is no further faster cache where a cache line must be filled), so the plotted bandwidth for L1 is only the upper limit and not the badwidth itself.

Edit: When the step size in the inner for-loop is chosen to be

8 (instead of 16) the collapse happens for 1KB
4 (instead of 16) the collapse happens for 0.5KB

i.e. when the inner loop consists of about 31-35 steps/reads. That means the collapse isn't due to the memory-size but due to the number of steps in the inner loop.

It can be explained with branch misses as shown in @user10605163's great answer.

Listing for reproducing the results

bandwidth.cpp:

#include <vector>
#include <chrono>
#include <iostream>
#include <algorithm>


//returns minimal time needed for one execution in seconds:
template<typename Fun>
double timeit(Fun&& stmt, int repeat, int number)
{  
   std::vector<double> times;
   for(int i=0;i<repeat;i++){
       auto begin = std::chrono::high_resolution_clock::now();
       for(int i=0;i<number;i++){
          stmt();
       }
       auto end = std::chrono::high_resolution_clock::now();
       double time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9/number;
       times.push_back(time);
   }
   return *std::min_element(times.begin(), times.end());
}


const int NITER=200;
const int NTRIES=5;
const size_t BLOCK_SIZE=16;


struct Worker{
   std::vector<unsigned int> &mem;
   size_t n;
   unsigned int result;
   void operator()(){
        for(size_t i=0;i<n;i+=BLOCK_SIZE){           
             result+=mem[i];
        }
   }

   Worker(std::vector<unsigned int> &mem_):
       mem(mem_), n(mem.size()), result(1)
   {}
};

double PREVENT_OPTIMIZATION=0.0;


double get_size_in_kB(int SIZE){
   return SIZE*sizeof(int)/(1024.0);
}

double get_speed_in_GB_per_sec(int SIZE){
   std::vector<unsigned int> vals(SIZE, 42);
   Worker worker(vals);
   double time=timeit(worker, NTRIES, NITER);
   PREVENT_OPTIMIZATION+=worker.result;
   return get_size_in_kB(SIZE)/(1024*1024)/time;
}


int main(){

   int size=BLOCK_SIZE*16;
   std::cout<<"size(kB),bandwidth(GB/s)\n";
   while(size<10e3){
       std::cout<<get_size_in_kB(size)<<","<<get_speed_in_GB_per_sec(size)<<"\n";
       size=(static_cast<int>(size+BLOCK_SIZE)/BLOCK_SIZE)*BLOCK_SIZE;
   }

   //ensure that nothing is optimized away:
   std::cerr<<"Sum: "<<PREVENT_OPTIMIZATION<<"\n";
}

create_report.py:

import sys
import pandas as pd
import matplotlib.pyplot as plt

input_file=sys.argv[1]
output_file=input_file[0:-3]+'png'
data=pd.read_csv(input_file)

labels=list(data)    
plt.plot(data[labels[0]], data[labels[1]], label="my laptop")
plt.xlabel(labels[0])
plt.ylabel(labels[1])   
plt.savefig(output_file)
plt.close()

Building/running/creating report:

>>> g++ -O3 -std=c++11 bandwidth.cpp -o bandwidth
>>> ./bandwidth > report.txt
>>> python create_report.py report.txt
# image is in report.png

@jxh Linux, not sure it depends on OS, rather on the intel-processor — ead, Dec 12 '18 at 21:17
See [this article](https://danluu.com/3c-conflict/) on the effect of page alignment. It looks like the effect you are observing is related. — K. Nielson, Dec 12 '18 at 21:30
The `void doit(...)` function at the top of your question doesn't do anything with `result`. gcc and clang both optimize it away completely. https://godbolt.org/z/6P4sqh. gcc optimizes the whole function to just `ret`, while clang still loops the right number of times but doesn't read memory. If you returned the result, it couldn't be optimized away. (Assuming that the caller also did something with it to defeat optimization after inlining.) https://godbolt.org/z/4vjC9v — Peter Cordes, Dec 13 '18 at 03:10
When you're getting L1 hits, it's not really accurate to claim 64 bytes "transferred" when you're only reading 4 bytes. You need a CPU with AVX512 to actually read 64 bytes from L1d with a single uop, and using 512-bit vectors reduces the max turbo clock on current Intel CPUs. (Also, without loop unrolling with multiple accumulators, you'll bottleneck on the 1 cycle latency of `add` instead of the 2 load per clock L1d throughput.) — Peter Cordes, Dec 13 '18 at 03:16

score 19 · Accepted Answer · 2018-12-13T03:57:38.860

I changed the values slightly: NITER = 100000 and NTRIES=1 to get a less noisy result.

I don't have a Broadwell available right now, however I tried your code on my Coffee-Lake and got a performance drop, not at 2KB, but around 4.5KB. In addition I find erratic behavior of the throughput slightly above 2KB.

The blue line in the graph corresponds to your measurement (left axis):

The red line here is the result from perf stat -e branch-instructions,branch-misses, giving the fraction of branches that were not correctly predicted (in percent, right axis). As you can see there is a clear anti-correlation between the two.

Looking into the more detailed perf report, I found that basically all of these branch mispredictions happen in the most inner loop in Worker::operator(). If the taken/non-taken pattern for the loop branch becomes too long the branch predictor will not be able to keep track of it and so the exit branch of the inner loop will be mispredicted, leading to the sharp drop in throughput. With further increasing number of iterations the impact of this single mispredict will become less significant leading to the slow recover of the throughput.

For further information on the erratic behavior before the drop see the comments made by @PeterCordes below.

In any case the best way to avoid branch mispredictions is to avoid branches and so I manually unrolled the loop in Worker::operator(), like e.g.:

void operator()(){
    for(size_t i=0;i+3*BLOCK_SIZE<n;i+=BLOCK_SIZE*4){
         result+=mem[i];
         result+=mem[i+BLOCK_SIZE];
         result+=mem[i+2*BLOCK_SIZE];
         result+=mem[i+3*BLOCK_SIZE];
    }
}

Unrolling 2, 3, 4, 6 or 8 iterations gives the results below. Note that I did not correct for the blocks at the end of the vector which were ignored due to the unrolling. Therefore the periodic peaks in the blue line should be ignored, the lower bound base line of the periodic pattern is the actual bandwidth.

As you can see the fraction of branch mispredictions didn't really change, but because the total number of branches is reduced by the factor of unrolled iterations, they will not contribute strongly to the performance anymore.

There is also an additional benefit of the processor being more free to do the calculations out-of-order if the loop is unrolled.

If this is supposed to have practical application I would suggest to try to give the hot loop a compile-time fixed number of iteration or some guarantee on divisibility, so that (maybe with some extra hints) the compiler can decide on the optimal number of iterations to unroll.

For a low enough iteration count, branch prediction correctly predicts the loop-exit branch of the inner-most loop. e.g. a 21 taken, 1 not-taken, 21 taken, 1 not-taken pattern. Beyond about 22 or 23 iterations on Skylake, the loop-exit branch will mispredict, so you have one mispredict (on the inner loop exit) per iteration of the outer loop. The more inner loop iterations you have, the less of the total cost this represents. And yes, unrolling will help a lot, letting out-of-order exec run ahead and see the mispredict, and not requiring 4 uops per clock throughput to keep up. — Peter Cordes, Dec 13 '18 at 03:20
@PeterCordes Thanks for the info. I am mainly surprised by the seemingly random peaks around 2.5 in the non-unrolled loop case. For example for `size = 35*BLOCK_SIZE` there seems to be one misprediction for the inner loop, while there is none for `34*BLOCK_SIZE` or `36*BLOCK_SIZE`. — , Dec 13 '18 at 03:30
TAGE predictors use recent branch history (taken/not-taken) as part of the *index* into the branch-prediction cache. This is the kind of weirdness you should expect when you're on the cusp of running out of BTB space or aliasing: one pattern predicts very well, a similar pattern runs into aliasing between important branches and predicts poorly. (The actual index function is usually a hash of branch history and branch address.) e.g. https://www.irisa.fr/caps/people/seznec/L-TAGE.pdf and https://comparch.net/2013/06/30/why-tage-is-the-best/ — Peter Cordes, Dec 13 '18 at 03:39
@user10605163 it really looks like branch misses are the reason. It seems I don't really understand what going on here, because it comes as a complete surprise to me, it plays a role at all... — ead, Dec 13 '18 at 06:11
@ead: nested loops where the inner-most loop has a small iteration count are a problem. It creates a pattern like every 25th execution of the same branch is not-taken, and the rest are taken. A mispredict stalls the pipeline and results in discarding a lot of work that was in flight. (And with a not-unrolled loop, out-of-order execution isn't helping as much as it could without a front-end bottleneck.) The last edit to this answer added a nice summary of that effect. See Agner Fog's cpu microarchitecture guide (https://agner.org/optimize/) and https://stackoverflow.com/tags/x86/info for more. — Peter Cordes, Dec 13 '18 at 06:29

score 2 · Answer 2 · answered Dec 13 '18 at 06:12

Might be unrelated but your Linux machine might playing with CPU frequency. I know Ubuntu 18 has a gouverner that is balanced between power and performance. You also want to play with the process affinity to make sure it does not get migrated to different core while running.

Reason for collapse of memory bandwidth when 2KB of data is cached in L1-cache

2 Answers2