I'm trying to reproduce example 6 of the Gallery of Processor Cache Effects.
The article gives this function (in C#) as an example how to test false sharing:
private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
for (int j = 0; j < 100000000; j++)
{
s_counter[position] = s_counter[position] + 3;
}
}
If we create threads passing to this function 0, 1, 2, 3 arguments it will take the computation a long time to finish (author got 4.3 seconds). If we pass, for instance, 16, 32, 48, 64, we'll get much nicer results (0.28 seconds).
I've come up with the following function in Rust:
pub fn cache_line_sharing(arr: [i32; 128], pos: usize) -> (i32, i32) {
let arr = Arc::new(arr);
let handles: Vec<_> = (0..4).map(|thread_number| {
let arr = arr.clone();
let pos = thread_number * pos;
thread::spawn(move || unsafe {
let p = (arr.as_ptr() as *mut i32).offset(pos as isize);
for _ in 0..1_000_000 {
*p = (*p).wrapping_add(3);
}
})
}).collect();
for handle in handles {
handle.join().unwrap();
}
(arr[0], arr[1])
}
Benchmarking it with two sets of arguments (0, 1, 2, 3 and 0, 16, 32, 48) gives me almost identical results: 108.34 and 105.07 microseconds.
I use the criterion crate for the benchmarks. I have a MacBook Pro 2015 with Intel i5-5257U CPU (2.70GHz). My system reports to have 64B
cache line size.
If anybody wants to see my full benchmarking code, here are the links: - lib.rs - cache_lines.rs
I want to understand the problem and find the way to reproduce similar results as in the article.