False sharing is best detectable when 2 atomic variables in the same cache line are incremented by different threads with a read-modify-write (RMW) operation.
To do that, each CPU has to flush the store buffer and lock the cache line for the duration of an increment operation, that is:
- lock cache line
- read value from L1 cache into register
- increment value inside register
- write back to L1 cache
- unlock cache line
The effect of a single cache line constantly bouncing between CPU's is noticeable, even with full compiler optimization.
Forcing both variables to be in different cache lines (by adding padding data) may cause a significant increase in performance since each CPU will have full access to its own cache line.
Locking the cache line is still necessary, but no time is wasted on obtaining read-write access to the cache line.
If both variables are plain integers, the situation is different because incrementing an integer involves a plain load & store (ie. not an atomic RMW operation).
Without padding, the effects of the cache line bouncing between cores may still be noticeable, but on a much smaller scale since no cache-line locking is involved anymore.
If you compile with full optimization, the entire while-loop is probably going to be replaced with a single increment and there will be no difference anymore.
On my 4-core X86, I get the following numbers:
atomic int, no padding, no optimization: real 57.960s, user 114.495s
atomic int, padding, no optimization: real 10.514s, user 20.793s
atomic int, no padding, full optimization: real 55.732s, user 110.178s
atomic int, padding, full optimization: real 8.712s, user 17.214s
int, no padding, no optimization: real 2.206s, user 4.348s
int, padding, no optimization: real 1.951s, user 3.853s
int, no padding, full optimization: real 0.002s, user 0.000s
int, padding, full optimization: real 0.002s, user 0.000s