Can't reproduce false cache line sharing problem in Rust

Question

I'm trying to reproduce example 6 of the Gallery of Processor Cache Effects.

The article gives this function (in C#) as an example how to test false sharing:

private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}

If we create threads passing to this function 0, 1, 2, 3 arguments it will take the computation a long time to finish (author got 4.3 seconds). If we pass, for instance, 16, 32, 48, 64, we'll get much nicer results (0.28 seconds).

I've come up with the following function in Rust:

pub fn cache_line_sharing(arr: [i32; 128], pos: usize) -> (i32, i32) {
    let arr = Arc::new(arr);
    let handles: Vec<_> = (0..4).map(|thread_number| {
        let arr = arr.clone();
        let pos = thread_number * pos;
        thread::spawn(move || unsafe {
            let p = (arr.as_ptr() as *mut i32).offset(pos as isize);
            for _ in 0..1_000_000 {
                *p = (*p).wrapping_add(3);
            }
        })
    }).collect();

    for handle in handles {
        handle.join().unwrap();
    }

    (arr[0], arr[1])
}

Benchmarking it with two sets of arguments (0, 1, 2, 3 and 0, 16, 32, 48) gives me almost identical results: 108.34 and 105.07 microseconds.

I use the criterion crate for the benchmarks. I have a MacBook Pro 2015 with Intel i5-5257U CPU (2.70GHz). My system reports to have 64B cache line size.

If anybody wants to see my full benchmarking code, here are the links: - lib.rs - cache_lines.rs

I want to understand the problem and find the way to reproduce similar results as in the article.

nvm, I didn't see `let arr = Arc::new(arr);`, `Arc::clone(&from)` is more idiomatic — Stargateur, Jan 10 '19 at 17:11
try something bigger that `16` and fix your code if I call your function with something bigger that `16` it's UB — Stargateur, Jan 10 '19 at 17:23
The C# code runs 100x more iterations than the Rust version, which runs very quickly compared to the author's numbers. Is it possible you're mostly just measuring how long it takes to spawn 4 threads? — trent, Jan 10 '19 at 21:06

Anders Kaseorg · Accepted Answer · 2019-01-11T06:51:22.660

Your first problem is that *p.wrapping_add(3) does arithmetic on the pointer, not on the integer. The first iteration of the loop was loading the value three spaces after p and storing it at p, and Rust was optimizing away the other 999999 iterations of the loop as redundant. You meant (*p).wrapping_add(3).

After that change, Rust optimizes the 1000000 additions by 3 into one addition by 3000000. You can use read_volatile and write_volatile to avoid that optimization.

Although these two changes are sufficient demonstrate the effect you’re looking for in my test, note that using unsafe operations to mutate an immutably borrowed array is undefined behavior. Rust is allowed to optimize under the assumption that unsafe code upholds certain invariants, which this code does not, so Rust would be entirely within its rights to replace your code with anything it feels like.

You presumably used immutable borrows to get around the restriction on copying mutable references and mutable pointers between threads. Here’s a less undefined way to get around that restriction, I think (although honestly, I won’t be too surprised if someone replies to point out some way in which this is still wrong).

pub fn cache_line_sharing(arr: [i32; 128], pos: usize) -> (i32, i32) {
    struct SyncWrapper(UnsafeCell<[i32; 128]>);
    unsafe impl Sync for SyncWrapper {}

    assert_ne!(pos, 0);
    let arr = Arc::new(SyncWrapper(UnsafeCell::new(arr)));
    let handles: Vec<_> = (0..4)
        .map(|thread_number| {
            let arr = arr.clone();
            let pos = thread_number * pos;
            thread::spawn(move || unsafe {
                let p: *mut i32 = &mut (*arr.0.get())[pos];
                for _ in 0..1_000_000 {
                    p.write_volatile(p.read_volatile().wrapping_add(3));
                }
            })
        })
        .collect();

    for handle in handles {
        handle.join().unwrap();
    }

    let arr = unsafe { *arr.0.get() };
    (arr[0], arr[1])
}

Oh, I just realised I have some inconsistency between my local code and the code I uploaded to github and shared here. In my local code I wrote `*p = (*p).wrapping_mul(3)`, after I realised I really modified the pointer itself (compiler told me that multiplying them doesn't make much sense :) ). Anyway thanks for pointing this out. `UnsafeCell` and volatile operations are some new and interesting details for me, and they indeed did the trick! Until something else comes up, I'm happy to mark your answer as accepted! — mvlabat, Jan 11 '19 at 09:47

Can't reproduce false cache line sharing problem in Rust

1 Answers1