Causes of performance differences with different instruction orderings?

Question

I ran into a performance issue with a function similar to the following:

pub fn attacked(&self, sq: usize) -> bool {
    self.lut1[sq] || self.lut2[sq] || self.lut3[sq] || self.lut4[sq] || self.lut5[sq]
}

A number of look-up tables (i.e. arrays, [u64; N]) are queried and if any return true the result is true. It is a part of a larger program running a simulation. This lookup comprises about 5% of the logic. I measure the throughput of this simulation to look at performance. I use codegen-units = 1 and opt-level = 3.

It turns out that different orderings of the luts results in significantly different performance times. For certain orderings I get 65 Mnodes/s, while other orderings I get 50 Mnodes/s.

My data ensures that hitting any of the luts is equally likely so although evaluation is left-to-right, there shouldn't be a different due to short-circuiting. The tables are held in a struct on the stack and the total size is about 1MB, so it should be able to remain in CPU cache and not be affected by memory latency.

I'm not able to recreate a working example as when I start taking out seemingly unrelated code, the performance behavior changes or slowdown disappears.

I'm left baffled. Could the compiler sub-optimize due to instruction ordering? What other causes can there be?

You keep saying "table", which I'm hearing as `HashMap`. That's going to be heap allocated and thus the size of your cache is going to be irrelevant. I think it may help to see the actual struct in question. Also, if your struct's size (excluding heap allocations, just the *actual* size of the first layer of data) is 1MB (which is terrifying, by the way), then there's no way that's fitting into an L1 cache. On my laptop right now, my L1 is 128KB. Maybe some of the larger caches, but not L1, so that could be part of the problem. — Silvio Mayolo, Dec 14 '22 at 02:51
It certainly impacts the performance possibilities greatly depending on what your LUTs are (in-place arrays, vectors, hashmaps, etc.) — kmdreko, Dec 14 '22 at 02:55
I'm curious if replacing all the `||`s with `|`s will help. If all your LUTs are equally likely, than you could be giving the CPU branch-predictor a headache and that change would reduce the branches. Though that assumes that your lookups are cheap and again depends on what type the `self.lut*`s are. — kmdreko, Dec 14 '22 at 02:59
@SilvioMayolo: What CPU has 128 KiB L1 cache? Are you summing up L1 cache sizes across cores? Some system-info tools like to do that, but L1d and L1i caches are per-core private (like L2 on Intel since the first i7), so that's only meaningful for data footprint if you have different threads working on different parts of the same array. Oh, I just checked and [Apple M2](https://en.wikipedia.org/wiki/Apple_M2) *does* have very large 128 KiB per-core L1d caches in the P-cores, so hopefully that's what you meant. (The performance cores share a 16M L2 cache) — Peter Cordes, Dec 14 '22 at 04:19
Anyway, yeah, 1MiB is pretty big, although some CPUs have 1MiB of fast per-core private L2 cache. (e.g. Zen 4). Alder Lake has 1.25 MiB L2 per P-core with about 12 cycle hit latency. So if there isn't much other memory access, a 1MiB LUT could stay hot in L2. If there is, expect evictions. — Peter Cordes, Dec 14 '22 at 04:22
@kmdreko Interesting thought. Replacing `||` with `|` ended up dropping the performance. It makes me wonder if one of these tables is causing the slowdown somehow. — , Dec 14 '22 at 04:25
Perhaps different orderings create more predictable patterns for the branch predictor. (And/or was short-circuiting sooner, leading to fewer memory accesses in other arrays evicting from the first LUT). `|` instead of `||` was worth a try, but probably the extra memory accesses are leading to more evictions; maybe not everything managing to stay hot even in L2. If the arrays are all aligned the same relative to 4K or something, they might be aliasing in L1d or L2 when accessing the same offset in all of them, increasing evictions in L1d. — Peter Cordes, Dec 14 '22 at 04:28
@PeterCordes I didn't think about L1 vs L2. I suppose 1MB of data is sufficiently big to be kicked out of low latency cache. — , Dec 14 '22 at 04:28
Yes, that's what @SilvioMayolo correctly pointed out earlier. — Peter Cordes, Dec 14 '22 at 04:29
What types are your `lut*`s? Because if they're `[u64; N]`, your code [doesn't compile](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ef09a140fee9f2fb7ae6cd9b5c6d9d20). If they're actually `[bool; N]` as your code seems to suggest, you could try whether re-encoding them as bitmaps of `[u64; N]` and doing lookups like `lutN[sq / 64] >> (sq % 64)` will give you better cache behavior (because the tables would be ⅛ in size). — Caesar, Dec 14 '22 at 07:19
If you are under Linux, you can run your program through `perf stat -d mycommand` to get statistics on CPU cycles, branch mispredictions and cache access patterns. — Jmb, Dec 14 '22 at 08:17

Causes of performance differences with different instruction orderings?

0 Answers0