I ran into a performance issue with a function similar to the following:
pub fn attacked(&self, sq: usize) -> bool {
self.lut1[sq] || self.lut2[sq] || self.lut3[sq] || self.lut4[sq] || self.lut5[sq]
}
A number of look-up tables (i.e. arrays, [u64; N]
) are queried and if any return true the result is true. It is a part of a larger program running a simulation. This lookup comprises about 5% of the logic. I measure the throughput of this simulation to look at performance. I use codegen-units = 1
and opt-level = 3
.
It turns out that different orderings of the luts results in significantly different performance times. For certain orderings I get 65 Mnodes/s, while other orderings I get 50 Mnodes/s.
My data ensures that hitting any of the luts is equally likely so although evaluation is left-to-right, there shouldn't be a different due to short-circuiting. The tables are held in a struct on the stack and the total size is about 1MB, so it should be able to remain in CPU cache and not be affected by memory latency.
I'm not able to recreate a working example as when I start taking out seemingly unrelated code, the performance behavior changes or slowdown disappears.
I'm left baffled. Could the compiler sub-optimize due to instruction ordering? What other causes can there be?