2

I wrote the following test class in java to reproduce the performance penalty introduced by "False Sharing".

Basically you can tweak the "size" of array from 4 to a much larger value (e.g. 10000) to turn the "False Sharing phenomenon" either on or off. To be specific, when size = 4, different threads are more likely to update values within the same cache line, causing much more frequent cache misses. In theory, the test program should run much faster when size = 10000 than size = 4.

I ran the same test on two different machines multiple times:

Machine A: Lenovo X230 laptop w/ Intel® Core™ i5-3210M Processor (2 core, 4 threads) Windows 7 64bit

size = 4 => 5.5 second

size = 10000 => 5.4 second

Machine B: Dell OptiPlex 780 w/ Intel® Core™2 Duo Processor E8400 (2 core) Windows XP 32bit

size = 4 => 14.5 second

size = 10000 => 7.2 second

I ran the tests later on a few other machines and quite obviously False Sharing only becomes noticeable on certain machines and I couldn't figure out the decisive factor that makes such difference.

Can anyone kindly take a look at this problem and explain why false sharing introduced in this test class only became noticeable on certain machines?

public class FalseSharing {

interface Oper {
    int eval(int value);
}

//try tweak the size
static int size = 4;

//try tweak the op
static Oper op = new Oper() {
    @Override
    public int eval(int value) {
        return value + 2;
    }
};

static int[] array = new int[10000 + size];

static final int interval = (size / 4);

public static void main(String args[]) throws InterruptedException {

    long start = System.currentTimeMillis();
    Thread t1 = new Thread(new Runnable() {
        @Override
        public void run() {

            System.out.println("Array index:" + 5000);

            for (int j = 0; j < 30; j++) {
                for (int i = 0; i < 1000000000; i++) {
                    array[5000] = op.eval(array[5000]);
                }
            }
        }
    });
    Thread t2 = new Thread(new Runnable() {
        @Override
        public void run() {

            System.out.println("Array index:" + (5000 + interval));

            for (int j = 0; j < 30; j++) {
                for (int i = 0; i < 1000000000; i++) {
                    array[5000 + interval] = op.eval(array[5000 + interval]);
                }
            }
        }
    });
    Thread t3 = new Thread(new Runnable() {
        @Override
        public void run() {

            System.out.println("Array index:" + (5000 + interval * 2));

            for (int j = 0; j < 30; j++) {
                for (int i = 0; i < 1000000000; i++) {
                    array[5000 + interval * 2] = op.eval(array[5000 + interval * 2]);
                }
            }
        }
    });
    Thread t4 = new Thread(new Runnable() {
        @Override
        public void run() {

            System.out.println("Array index:" + (5000 + interval * 3));

            for (int j = 0; j < 30; j++) {
                for (int i = 0; i < 1000000000; i++) {
                    array[5000 + interval * 3] = op.eval(array[5000 + interval * 3]);
                }
            }
        }
    });
    t1.start();
    t2.start();
    t3.start();
    t4.start();
    t1.join();
    t2.join();
    t3.join();
    t4.join();
    System.out.println("Finished!" + (System.currentTimeMillis() - start));
}

}

njzhxf
  • 837
  • 1
  • 7
  • 9
  • Architectural differences can change the result. Mobile processors are designed differently to save power and don't perform as fast in the first place. A drop of 2x is pretty small, you can get much worse slow down due to false sharing. – Peter Lawrey Sep 13 '13 at 10:48
  • @PeterLawrey In this case the mobile processor (i5-3210M) shows no drop on performance:) – njzhxf Sep 13 '13 at 10:56

2 Answers2

0

False sharing only occurs with blocks of 64 bytes. You need to be accessing the same 64-byte block in all four threads. I suggest you create an object or an array with long[8] and update different cells of this array in all four threads and compare with the four threads accessing independent arrays.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • I think I AM accessing the same array in this test. – njzhxf Sep 14 '13 at 01:49
  • @njzhxf it doesn't matter if its the same array or object. What matters is if you are in the same 64 byte cache line ie same 64 bytes of address space. Your object is so large. It is spread across many cache lines there is very little sharing. – Peter Lawrey Sep 14 '13 at 06:45
  • when size is 4, 4 threads are accessing the contiguous 4 cells in the integer array, and these 4 cells are very likely to be placed in the same cache line. @PeterLawrey – njzhxf Sep 14 '13 at 08:19
  • @njzhxf Correct, you don't have control over whether you are using one cache line or two unless you use direct memory. (In which case you can examine the address) – Peter Lawrey Sep 14 '13 at 18:04
0

Your code is probably fine, Here is a simpler version with results:

import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;


public class TestFalseSharing {
    static long T0 = System.currentTimeMillis();

    static void p(Object msg) {
        System.out.format("%09.3f %-10s %s%n", new Double(0.001*(System.currentTimeMillis()-T0)), Thread.currentThread().getName(), " : "+msg);
    }

    public static void main(String args[]) throws InterruptedException {
        int NT = Runtime.getRuntime().availableProcessors();
        p("Available processors: "+NT);

        int MAXSPAN = 0x1000; //4kB
        final byte[] array = new byte[NT*MAXSPAN];

        for(int i=1; i<=MAXSPAN; i<<=1) {
            testFalseSharing(NT, i, array);
        }
    }

    static void testFalseSharing(final int NT, final int span, final byte[] array) throws InterruptedException {
        final int L1 = 10;
        final int L2 = 10_000_000;

        final CountDownLatch cl = new CountDownLatch(NT*L1);

        long t0 = System.nanoTime();

        for(int i=0 ; i<4; i++) {
            final int startOffset = i*span;

            Thread t = new Thread(new Runnable() {
                @Override
                public void run() {
                    //p("Offset:" + startOffset);
                    for (int j = 0; j < L1; j++) {
                        for (int k = 0; k < L2; k++) {
                            array[startOffset] += 1;
                        }
                        cl.countDown();
                    }
                }
            });
            t.start();

        }

        while(!cl.await(10, TimeUnit.SECONDS)) {
            p(""+cl.getCount()+" left");
        }

        long d = System.nanoTime() - t0;
        p("Duration: " + 1e-9*d + " seconds, Span="+span+" bytes");
    }
}

Results:

00000.000 main        : Available processors: 4
00002.843 main        : Duration: 2.837645384 seconds, Span=1 bytes
00005.689 main        : Duration: 2.8454065760000002 seconds, Span=2 bytes
00008.659 main        : Duration: 2.9697156340000004 seconds, Span=4 bytes
00011.640 main        : Duration: 2.979306959 seconds, Span=8 bytes
00013.780 main        : Duration: 2.140246744 seconds, Span=16 bytes
00015.387 main        : Duration: 1.6061148440000002 seconds, Span=32 bytes
00016.729 main        : Duration: 1.34128957 seconds, Span=64 bytes
00017.944 main        : Duration: 1.215005455 seconds, Span=128 bytes
00019.208 main        : Duration: 1.263007368 seconds, Span=256 bytes
00020.477 main        : Duration: 1.269272208 seconds, Span=512 bytes
00021.719 main        : Duration: 1.241061631 seconds, Span=1024 bytes
00022.975 main        : Duration: 1.256024242 seconds, Span=2048 bytes
00024.171 main        : Duration: 1.195086858 seconds, Span=4096 bytes

So to answer, it confirms the 64 bytes cache line theory, at least on my laptop core i5.

user2023577
  • 1,752
  • 1
  • 12
  • 23