I'm using OProfile to profile the following function on a raspberry pi 3B+. (I'm using gcc version 10.2 on the raspberry (not doing cross-compilation) and the following flags for the compiler: -O1 -mfpu-neon -mneon-for-64bits
. The generate assembly code are included at the end.)
void do_stuff_u32(const uint32_t* a, const uint32_t* b, uint32_t* c, size_t array_size)
{
for (int i = 0; i < array_size; i++)
{
uint32_t tmp1 = b[i];
uint32_t tmp2 = a[i];
c[i] = tmp1 * tmp2;
}
}
I'm looking at L1D_CACHE_REFILL
and PREFETCH_LINEFILL
these 2 cpu events. Looking at the doc, PREFETCH_LINEFILL
counts the number of cache line fill because of prefetch, and L1D_CACHE_REFILL
counts the number of cache line refill because of cache misses. I got the following results for the above loop:
array_size | array_size / L1D_CACHE_REFILL | array_size / PREFETCH_LINEFILL |
---|---|---|
16777216 | 18.24 | 8.366 |
I would imagine the above loop is memory bound, which is somehow confirmed by the value 8.366: Every loop instance needs 3 x uint32_t
which is 12B. And 8.366 loop instances needs ~100B of data from the memory. But the prefetcher can only fill 1 cache line to L1 every 8.366 loop instances, which has 64B by the manual of Cortex-A53. So the rest of the cache accesses would contribute to cache misses, which is the 18.24. If you combine these two number, you get ~5.7, that means 1 cache line fill from either prefetch or cache miss refill every 5.7 loop instances. And 5.7 loop instances needs 5.7 x 3 x 4 = 68B, more or less consistent with the cache line size.
Then I added more stuff to the loop, which then becomes the following:
void do_more_stuff_u32(const uint32_t* a, const uint32_t* b, uint32_t* c, size_t array_size)
{
for (int i = 0; i < array_size; i++)
{
uint32_t tmp1 = b[i];
uint32_t tmp2 = a[i];
tmp1 = tmp1 * 17;
tmp1 = tmp1 + 59;
tmp1 = tmp1 /2;
tmp2 = tmp2 *27;
tmp2 = tmp2 + 41;
tmp2 = tmp2 /11;
tmp2 = tmp2 + tmp2;
c[i] = tmp1 * tmp2;
}
}
And the profiling data of the cpu events is something I don't understand:
array_size | array_size / L1D_CACHE_REFILL | array_size / PREFETCH_LINEFILL |
---|---|---|
16777216 | 11.24 | 7.034 |
Since the loop takes longer to execute, the prefetcher now only needs 7.034 loop instances to fill 1 cache line. But what I don't understand is why cache missed also happens more frequently, reflecting by the number 11.24, compared to 18.24 before? Can someone please shed some light on how all these can be put together?
Update to include the generated assembly
Loop1:
cbz x3, .L178
lsl x6, x3, 2
mov x3, 0
.L180:
ldr w4, [x1, x3]
ldr w5, [x0, x3]
mul w4, w4, w5
lsl w4, w4, 1
str w4, [x2, x3]
add x3, x3, 4
cmp x3, x6
bne .L180
.L178:
Loop2:
cbz x3, .L178
lsl x6, x3, 2
mov x5, 0
mov w8, 27
mov w7, 35747
movk w7, 0xba2e, lsl 16
.L180:
ldr w3, [x1, x5]
ldr w4, [x0, x5]
add w3, w3, w3, lsl 4
add w3, w3, 59
mul w4, w4, w8
add w4, w4, 41
lsr w3, w3, 1
umull x4, w4, w7
lsr x4, x4, 35
mul w3, w3, w4
lsl w3, w3, 1
str w3, [x2, x5]
add x5, x5, 4
cmp x5, x6
bne .L180
.L178: