I performed, as a part of an academic research, the following experiment:
buff = mmap(NULL, BUFFSIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | HUGEPAGES, -1, 0);
lineAddr = buff;
for (int i = 0; i < BUFFSIZE; i++)
clflush(&(buff[i]));
for (int i = 0; i < LINES; i ++){
srand(rdtscp());
result = memaccesstime(lineAddr);
lineAddr = (void*)((uint64_t)lineAddr + (rand()%20+3)*(8*sizeof(void*)));
resultArr[i] = result;
}
MemAccessTime function returns the response time in cpu ticks.
static inline uint32_t memaccesstime(void *v) {
uint32_t rv;
asm volatile (
"mfence\n"
"lfence\n"
"rdtscp\n"
"mov %%eax, %%esi\n"
"mov (%1), %%eax\n"
"rdtscp\n"
"sub %%esi, %%eax\n"
: "=&a" (rv): "r" (v): "ecx", "edx", "esi");
return rv;
}
So the steps are:
- Allocated a long range of memory (with mmap()).
- clflush() all the line (with for loop)
- Running over random lines (with steps between 3 to 23) and measured the response time.
The results: Results
Please help me understand the results better. Why after short number of samples, the response time is plunging?
Notes: The MSR register 0x1a4 value is 0xF (but behavior is the same with 0x0) I've chosen random steps to avoid the "stride" prefetcher. Is there any other hardware (or software) prefetcher that could be responsible for those results?