43

I had a small dispute over performance of synchronized block in Java. This is a theoretical question, which does not affect real life application.

Consider single-thread application, which uses locks and synchronize sections. Does this code work slower than the same code without synchronize sections? If so, why? We do not discuss concurrency, since it’s only single thread application

Update

Found interesting benchmark testing it. But it's from 2001. Things could have changed dramatically in the latest version of JDK

Hearen
  • 7,420
  • 4
  • 53
  • 63
Anton
  • 5,831
  • 3
  • 35
  • 45
  • Nice as that article is, things have evolved **a lot** since it was written ten years ago. – NPE Dec 15 '11 at 14:49
  • Long answer: yes. The JVM will always need to resolve whether or not the object's key is available, regardless of how evolved Java becomes. – arkon Jun 03 '13 at 01:30
  • Note that the article publishes ratios: it takes 2.4 times as long to call an empty method. This is is unhelpful because locking isn't a proportional thing. `synchronized (lock)` will take a roughly *fixed* amount of time per call to check that the lock is available and then acquire it, even if there are no other threads running. The amount of time will depend on your installation and hardware. Mine (Linux, Oracle Java 8 JRE, Intel i7) took 27.25 nanoseconds average to acquire a lock. – Terrible Tadpole Mar 03 '18 at 08:37

6 Answers6

58

Single-threaded code will still run slower when using synchronized blocks. Obviously you will not have other threads stalled while waiting for other threads to finish, however you will have to deal with the other effects of synchronization, namely cache coherency.

Synchronized blocks are not only used for concurrency, but also visibility. Every synchronized block is a memory barrier: the JVM is free to work on variables in registers, instead of main memory, on the assumption that multiple threads will not access that variable. Without synchronization blocks, this data could be stored in a CPU's cache and different threads on different CPUs would not see the same data. By using a synchronization block, you force the JVM to write this data to main memory for visibility to other threads.

So even though you're free from lock contention, the JVM will still have to do housekeeping in flushing data to main memory.

In addition, this has optimization constraints. The JVM is free to reorder instructions in order to provide optimization: consider a simple example:

foo++;
bar++;

versus:

foo++;
synchronized(obj)
{
    bar++;
}

In the first example, the compiler is free to load foo and bar at the same time, then increment them both, then save them both. In the second example, the compiler must perform the load/add/save on foo, then perform the load/add/save on bar. Thus, synchronization may impact the ability of the JRE to optimize instructions.

(An excellent book on the Java Memory Model is Brian Goetz's Java Concurrency In Practice.)

Edward Thomson
  • 74,857
  • 14
  • 158
  • 187
47

There are 3 type of locking in HotSpot

  1. Fat: JVM relies on OS mutexes to acquire lock.
  2. Thin: JVM is using CAS algorithm.
  3. Biased: CAS is rather expensive operation on some of the architecture. Biased locking - is special type of locking optimized for scenario when only one thread is working on object.

By default JVM uses thin locking. Later if JVM determines that there is no contention thin locking is converted to biased locking. Operation that changes type of the lock is rather expensive, hence JVM does not apply this optimization immediately. There is special JVM option - XX:BiasedLockingStartupDelay=delay which tells JVM when this kind of optimization should be applied.

Once biased, that thread can subsequently lock and unlock the object without resorting to expensive atomic instructions.

Answer to the question: it depends. But if biased, the single threaded code with locking and without locking has average same performance.

NamshubWriter
  • 23,549
  • 2
  • 41
  • 59
Anton
  • 5,831
  • 3
  • 35
  • 45
20

There is some overhead in acquiring a non-contested lock, but on modern JVMs it is very small.

A key run-time optimization that's relevant to this case is called "Biased Locking" and is explained in the Java SE 6 Performance White Paper.

If you wanted to have some performance numbers that are relevant to your JVM and hardware, you could construct a micro-benchmark to try and measure this overhead.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 6
    I tested this. It is so small that you cannot measure the effect at all. They say that the effect was much more significant for older versions of JVM. – AlexR Dec 15 '11 at 14:53
  • 1
    @AlexR: Nice, thanks for sharing. It does not surprise me that the effect used to be more significant, since Biased Locking optimizations only got added in Java 6. – NPE Dec 15 '11 at 14:56
  • 4
    *so small that you cannot measure the effect at all* such claim cannot be made lightly. when test something in a tight loop, JVM can do great magics. but that doesn't represent "real world" apps. JVM gets stupid really fast when execution becomes complex. – irreputable Dec 16 '11 at 04:04
  • 1
    @AlexR I agree with irreputable's comment, in fact, your statement contradicts itself. Besides, unless you claim to be an expert in [designing reliable and proper microbenchmarks in Java](http://www.ibm.com/developerworks/java/library/j-jtp02225/index.html), then anyone should be taking that **bold** claim with a grain of salt. – arkon Jun 03 '13 at 01:09
10

Using locks when you don't need to will slow down your application. It could be too small to measure or it could be surprisingly high.

IMHO Often the best approach is to use lock free code in a single threaded program to make it clear this code is not intended to be shared across thread. This could be more important for maintenance than any performance issues.

public static void main(String... args) throws IOException {
    for (int i = 0; i < 3; i++) {
        perfTest(new Vector<Integer>());
        perfTest(new ArrayList<Integer>());
    }
}

private static void perfTest(List<Integer> objects) {
    long start = System.nanoTime();
    final int runs = 100000000;
    for (int i = 0; i < runs; i += 20) {
        // add items.
        for (int j = 0; j < 20; j+=2)
            objects.add(i);
        // remove from the end.
        while (!objects.isEmpty())
            objects.remove(objects.size() - 1);
    }
    long time = System.nanoTime() - start;
    System.out.printf("%s each add/remove took an average of %.1f ns%n", objects.getClass().getSimpleName(),  (double) time/runs);
}

prints

Vector each add/remove took an average of 38.9 ns
ArrayList each add/remove took an average of 6.4 ns
Vector each add/remove took an average of 10.5 ns
ArrayList each add/remove took an average of 6.2 ns
Vector each add/remove took an average of 10.4 ns
ArrayList each add/remove took an average of 5.7 ns

From a performance point of view, if 4 ns is important to you, you have to use the non-synchronized version.

For 99% of use cases, the clarity of the code is more important than performance. Clear, simple code often performs reasonably good as well.

BTW: I am using a 4.6 GHz i7 2600 with Oracle Java 7u1.


For comparison if I do the following where perfTest1,2,3 are identical.

    perfTest1(new ArrayList<Integer>());
    perfTest2(new Vector<Integer>());
    perfTest3(Collections.synchronizedList(new ArrayList<Integer>()));

I get

ArrayList each add/remove took an average of 2.6 ns
Vector each add/remove took an average of 7.5 ns
SynchronizedRandomAccessList each add/remove took an average of 8.9 ns

If I use a common perfTest method it cannot inline the code as optimally and they are all slower

ArrayList each add/remove took an average of 9.3 ns
Vector each add/remove took an average of 12.4 ns
SynchronizedRandomAccessList each add/remove took an average of 13.9 ns

Swapping the order of tests

ArrayList each add/remove took an average of 3.0 ns
Vector each add/remove took an average of 39.7 ns
ArrayList each add/remove took an average of 2.0 ns
Vector each add/remove took an average of 4.6 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.5 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.4 ns
ArrayList each add/remove took an average of 2.4 ns
Vector each add/remove took an average of 4.6 ns

one at a time

ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 2.3 ns
ArrayList each add/remove took an average of 2.2 ns
ArrayList each add/remove took an average of 2.4 ns

and

Vector each add/remove took an average of 28.4 ns
Vector each add/remove took an average of 37.4 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1
    I tested it on an IBM JDK and except for the first run Vector and ArrayList have about a 10% performance difference on my machine (54ns vs 48-50ns). I also tested it with Collections.synchronizedList and was surprised by its bad perfomance. It was about twice as slow as Vector/ArrayList(110ns). – Stefan Dec 15 '11 at 15:36
  • This is another reason to be concerned about micro-tuning. Using a different system, hardware, JVM can give you a different result. – Peter Lawrey Dec 15 '11 at 15:39
  • btw, the code like that is first optimized for Vector, then deoptimized and optimized again, since the call target (List) changes. Since can't be sure about the proper deoptimization (could be just guarded call to vector + trap) ArrayList case would suffer. Can you swap the test, i.e. ArrayList then Vector. Mostly curious. OTOH the case is bloody perfect bias locking test too. Also the CAS is quite cheap on your CPU, on older architectures CAS is quite an expensive call (if biased locking is disabled) – bestsss Dec 15 '11 at 15:42
  • 1
    @bestsss Check http://ideone.com/8jpoZ. Didnt make a difference. It looks like Vector needs a while to warmup, see last results. – Stefan Dec 15 '11 at 15:46
  • @bestsss I think the tests show its best to measure rather than guess, because I wouldn't have thought that running ArrayList first would be faster for both collections, or that running Vector alone would be slower than with ArrayList. – Peter Lawrey Dec 15 '11 at 15:49
  • The results appear to be platform dependant. Are you using Java 7u1? – Peter Lawrey Dec 15 '11 at 15:50
  • @Peter Lawrey Ideone is using sun-jdk-1.6.0.17 at work we use the IBM JDK version 1.6.0 – Stefan Dec 15 '11 at 15:56
  • @Stefan, it is curious that its 2.3 ns on my test vs 50 ns in your test. It suggests that other factors are far more important. ;) – Peter Lawrey Dec 15 '11 at 15:59
  • Your machine is probably better than my 5 year old (steam powered) computer. Btw i ran the test with more warmup and i consistenly get(time factor) AL 1x(~26ns), Vec 2x(~52ns), Sync >3x(~90ns) – Stefan Dec 15 '11 at 16:04
  • @Stefan can you try the code calling three copies of the `perfTest` method as I have above? – Peter Lawrey Dec 15 '11 at 16:07
  • @Peter Lawrey Ah i forgot i made some changes to it. But the numbers are the same after some warmup(~26ns vs ~52ns). – Stefan Dec 15 '11 at 16:13
  • another side note for better warm up and microbenchmark, spread the big loop into 2 different methods. something like `static void invokeTest (){for i=0;i<100000;i++) perfTest(objects);}` and reduce runs to something like 1000, time calculation should be outside the inner loop, ofc. this allows the JIT to avoid direct OSR in the very hot loop. Some nice info: http://www.azulsystems.com/blog/cliff/2011-11-22-what-the-heck-is-osr-and-why-is-it-bad-or-good – bestsss Dec 15 '11 at 23:27
0

This sample code (with 100 threads making 1,000,000 iterations each one) demonstrates the performance difference between avoiding and not avoiding a synchronized block.

Output:

Total time(Avoid Sync Block): 630ms
Total time(NOT Avoid Sync Block): 6360ms
Total time(Avoid Sync Block): 427ms
Total time(NOT Avoid Sync Block): 6636ms
Total time(Avoid Sync Block): 481ms
Total time(NOT Avoid Sync Block): 5882ms

Code:

import org.apache.commons.lang.time.StopWatch;

public class App {
    public static int countTheads = 100;
    public static int loopsPerThead = 1000000;
    public static int sleepOfFirst = 10;

    public static int runningCount = 0;
    public static Boolean flagSync = null;

    public static void main( String[] args )
    {        
        for (int j = 0; j < 3; j++) {     
            App.startAll(new App.AvoidSyncBlockRunner(), "(Avoid Sync Block)");
            App.startAll(new App.NotAvoidSyncBlockRunner(), "(NOT Avoid Sync Block)");
        }
    }

    public static void startAll(Runnable runnable, String description) {
        App.runningCount = 0;
        App.flagSync = null;
        Thread[] threads = new Thread[App.countTheads];

        StopWatch sw = new StopWatch();
        sw.start();
        for (int i = 0; i < threads.length; i++) {
            threads[i] = new Thread(runnable);
        }
        for (int i = 0; i < threads.length; i++) {
            threads[i].start();
        }
        do {
            try {
                Thread.sleep(10);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        } while (runningCount != 0);
        System.out.println("Total time"+description+": " + (sw.getTime() - App.sleepOfFirst) + "ms");
    }

    public static void commonBlock() {
        String a = "foo";
        a += "Baa";
    }

    public static synchronized void incrementCountRunning(int inc) {
        runningCount = runningCount + inc;
    }

    public static class NotAvoidSyncBlockRunner implements Runnable {

        public void run() {
            App.incrementCountRunning(1);
            for (int i = 0; i < App.loopsPerThead; i++) {
                synchronized (App.class) {
                    if (App.flagSync == null) {
                        try {
                            Thread.sleep(App.sleepOfFirst);
                        } catch (InterruptedException e) {
                            e.printStackTrace();
                        }
                        App.flagSync = true;
                    }
                }
                App.commonBlock();
            }
            App.incrementCountRunning(-1);
        }
    }

    public static class AvoidSyncBlockRunner implements Runnable {

        public void run() {
            App.incrementCountRunning(1);
            for (int i = 0; i < App.loopsPerThead; i++) {
                // THIS "IF" MAY SEEM POINTLESS, BUT IT AVOIDS THE NEXT 
                //ITERATION OF ENTERING INTO THE SYNCHRONIZED BLOCK
                if (App.flagSync == null) {
                    synchronized (App.class) {
                        if (App.flagSync == null) {
                            try {
                                Thread.sleep(App.sleepOfFirst);
                            } catch (InterruptedException e) {
                                e.printStackTrace();
                            }
                            App.flagSync = true;
                        }
                    }
                }
                App.commonBlock();
            }
            App.incrementCountRunning(-1);
        }
    }
}
Hailton
  • 1,192
  • 9
  • 14
  • 1
    The OP is asking what happens for single-thread applications, ie, if unnecessary synchronisation makes a difference. The differences you obtain in your case are obvious, cause you're synchronising many actual threads upon the same object all the time, forcing them to almost run one at a time. – zakmck Apr 11 '19 at 13:05
0

Assuming you're using the HotSpot VM, I believe the JVM is able to recognize that there is no contention for any resources within the synchronized block and treat it as "normal" code.

sworisbreathing
  • 670
  • 4
  • 16
  • 3
    Citation, please. I don't think that the JVM can eliminate monitor entries and exits entirely. – erickson Dec 15 '11 at 14:48
  • I've read that somewhere as well. If the Hotspot compiler is sure that the code is only reachable from one thread, it should skip the synchronization completely. I'm not quite sure about the "is sure ..." part though and I've never actually managed to get the VM to do it. Even in a single-thread application, synchronization overhead should not be underestimated. – jarnbjo Dec 15 '11 at 14:49
  • Not sure, that it is possible for JVM to make this optimization – Anton Dec 15 '11 at 14:49
  • 2
    It can coalesce synhchronized blocks when inlining. This combines the locking of several methods into one to reduce its overhead. – Peter Lawrey Dec 15 '11 at 16:09
  • @erickson, lock coarsening is a standard procedure + biased locked, you can look up how it works. although biased locking can be hit and miss depending on the underlying architecture. – bestsss Dec 15 '11 at 23:32