Why cache misses happen more when more data is prefetched on ARM?

Question

I'm using OProfile to profile the following function on a raspberry pi 3B+. (I'm using gcc version 10.2 on the raspberry (not doing cross-compilation) and the following flags for the compiler: -O1 -mfpu-neon -mneon-for-64bits. The generate assembly code are included at the end.)

void do_stuff_u32(const uint32_t* a, const uint32_t* b, uint32_t* c, size_t array_size)
{
  for (int i = 0; i < array_size; i++)
  {

    uint32_t tmp1 = b[i];
    uint32_t tmp2 = a[i];
    c[i] = tmp1 * tmp2;
  }
}

I'm looking at L1D_CACHE_REFILL and PREFETCH_LINEFILL these 2 cpu events. Looking at the doc, PREFETCH_LINEFILL counts the number of cache line fill because of prefetch, and L1D_CACHE_REFILL counts the number of cache line refill because of cache misses. I got the following results for the above loop:

array_size	array_size / L1D_CACHE_REFILL	array_size / PREFETCH_LINEFILL
16777216	18.24	8.366

I would imagine the above loop is memory bound, which is somehow confirmed by the value 8.366: Every loop instance needs 3 x uint32_t which is 12B. And 8.366 loop instances needs ~100B of data from the memory. But the prefetcher can only fill 1 cache line to L1 every 8.366 loop instances, which has 64B by the manual of Cortex-A53. So the rest of the cache accesses would contribute to cache misses, which is the 18.24. If you combine these two number, you get ~5.7, that means 1 cache line fill from either prefetch or cache miss refill every 5.7 loop instances. And 5.7 loop instances needs 5.7 x 3 x 4 = 68B, more or less consistent with the cache line size.

Then I added more stuff to the loop, which then becomes the following:

void do_more_stuff_u32(const uint32_t* a, const uint32_t* b, uint32_t* c, size_t array_size)
{
  for (int i = 0; i < array_size; i++)
  {

    uint32_t tmp1 = b[i];
    uint32_t tmp2 = a[i];
    tmp1 = tmp1 * 17;
    tmp1 = tmp1 + 59;
    tmp1 = tmp1 /2;
    tmp2 = tmp2 *27;
    tmp2 = tmp2 + 41;
    tmp2 = tmp2 /11;
    tmp2 = tmp2 + tmp2;
    c[i] = tmp1 * tmp2;
  }
}

And the profiling data of the cpu events is something I don't understand:

array_size	array_size / L1D_CACHE_REFILL	array_size / PREFETCH_LINEFILL
16777216	11.24	7.034

Since the loop takes longer to execute, the prefetcher now only needs 7.034 loop instances to fill 1 cache line. But what I don't understand is why cache missed also happens more frequently, reflecting by the number 11.24, compared to 18.24 before? Can someone please shed some light on how all these can be put together?

Update to include the generated assembly

Loop1:

    cbz x3, .L178
    lsl x6, x3, 2
    mov x3, 0
.L180:
    ldr w4, [x1, x3]
    ldr w5, [x0, x3]
    mul w4, w4, w5
    lsl w4, w4, 1
    str w4, [x2, x3]
    add x3, x3, 4
    cmp x3, x6
    bne .L180
.L178:

Loop2:

    cbz x3, .L178
    lsl x6, x3, 2
    mov x5, 0
    mov w8, 27
    mov w7, 35747
    movk    w7, 0xba2e, lsl 16
.L180:
    ldr w3, [x1, x5]
    ldr w4, [x0, x5]
    add w3, w3, w3, lsl 4
    add w3, w3, 59
    mul w4, w4, w8
    add w4, w4, 41
    lsr w3, w3, 1
    umull   x4, w4, w7
    lsr x4, x4, 35
    mul w3, w3, w4
    lsl w3, w3, 1
    str w3, [x2, x5]
    add x5, x5, 4
    cmp x5, x6
    bne .L180
.L178:

@artlessnoise thanks for the suggestions and taking the time into this. I've updated the post to include the compiler flag used and the generated assembly — Da Teng, Jan 17 '21 at 15:23
One issue is how you account for a context switch. There are other processes running (such as interrupts and kernel tasks) that will also execute on the CPU and can cause both L1 value to be evicted. Indeed, this maybe desired as it is an absolute performance and not a 'debug aid' as you are trying to use. I suggest you test this by loading the system (man stress, etc) while profiling and see if the numbers change. The generated assembler has similar `ldr` and `str` pattern so it should not be due to direct program behaviour. However, you are not on a bare metal loop but a complex system. — artless noise, Jan 17 '21 at 15:36
You may get more consistent results if you run the test multiple times (but this prefills the cache) as you can get metrics over a much larger period which maybe influence by other sporadic system behavior. Conceptually, this is how spectre/meltdown are working by examining cache behavior to leak information from other processes. — artless noise, Jan 17 '21 at 15:43
@artlessnoise I tried to run it over a long timespan and the results are similar. To do that, I ran it repeatedly, over and over again. — Da Teng, Jan 17 '21 at 21:26
I think your analysis is incorrect. For instance, *Every loop instance needs 3 x uint32_t which is 12B.* is not correct. You have two 32bit reads and one write. The writes do not need prefetching. Best case, the 2x4bytes will cause a cache read every 8 loop iterations. It is not clear if prefetch is I/D only or both. You will also have some code prefetchs. Contrary to prefetch, the refill will involve writes as you update memory. So refills **may** be higher as the write **can** cache access per loop, depending on cache configuration (write through or write back). — artless noise, Jan 18 '21 at 12:43
@artlessnoise I agree the manual is not super clear if prefetch is I/D, and some experiments did by others suggests it could count both (https://falstaff.agner.ch/2015/10/26/using-the-perf-utility-on-arm/, although it's for cortex-a5). Could you please elaborate more on "writes do not need prefetching"? Cause by Cortex-a53 manual 6.6.2 it says "data cache implements an automatic prefetcher that monitors cache misses in the core. When a pattern is detected, the automatic prefetcher starts..." and doesn't restrict to just read cache misses. And still doesn't explain increase of refill... — Da Teng, Jan 18 '21 at 16:23
... because the amount of read and write are the same in both cases. In the manual 6.2.5, it says "The L1 Data cache supports only a Write-Back policy." — Da Teng, Jan 18 '21 at 16:28
If you write a complete cache line, it doesn't need to read the original data. If you write a single byte in a cache line, then it needs read-modify-write. It maybe possible that in one case, it doesn't need to read the memory at all. Some CPUs will realize that is you 'sequential write' fast enough, there is no need for a prefetch, to read the data that will be overwritten. — artless noise, Jan 18 '21 at 17:15
@artlessnoise I think you are talking about Read Allocate mode: https://developer.arm.com/documentation/100236/0002/functional-description/cache-behavior-and-cache-protection/about-read-allocate-mode. But it doesn't say anything about prefetch. Based on our discussion, I will try post an answer that maybe could explain the observation. — Da Teng, Jan 22 '21 at 16:33

Da Teng · Answer 1 · 2021-01-22T17:00:24.917

I'll try answer my own question based on more measurement and discussion with @artlessnoise.

I further measured the READ_ALLOC_ENTER event for the above 2 loops and had the following data:

Loop 1

Array Size	READ_ALLOC_ENTER
16777216	12494

Loop 2

Array Size	READ_ALLOC_ENTER
16777216	1933

So apparently the small loop (1st) enters Read Allocate Mode a lot more than the big one (2nd), which could be due to the CPU was able to detect consecutive write pattern more easily. In read allocate mode, the stores went directly to L2 (if no hit in L1). That's why L1D_CACHE_REFILL is less for the 1st loop since it involves L1 less. For the 2nd loop, since it needs to involve L1 to update c[] more often than the 1st one, refill due to cache miss could be more. Moreover, for the second case, since L1 is often occupied with more cache lines for c[], it affects the cache hit rates for a[] and b[], thus more L1D_CACHE_REFILL.

Yes, this is the 'feature' I was alluding to. Great investigation and glad it solved your problem (or answered your question). — artless noise, Jan 22 '21 at 17:12