Java BlockingQueue latency high on Linux

Question

I am using BlockingQueue:s (trying both ArrayBlockingQueue and LinkedBlockingQueue) to pass objects between different threads in an application I’m currently working on. Performance and latency is relatively important in this application, so I was curious how much time it takes to pass objects between two threads using a BlockingQueue. In order to measure this, I wrote a simple program with two threads (one consumer and one producer), where I let the producer pass a timestamp (taken using System.nanoTime()) to the consumer, see code below.

I recall reading somewhere on some forum that it took about 10 microseconds for someone else who tried this (don’t know on what OS and hardware that was on), so I was not too surprised when it took ~30 microseconds for me on my windows 7 box (Intel E7500 core 2 duo CPU, 2.93GHz), whilst running a lot of other applications in the background. However, I was quite surprised when I did the same test on our much faster Linux server (two Intel X5677 3.46GHz quad-core CPUs, running Debian 5 with kernel 2.6.26-2-amd64). I expected the latency to be lower than on my windows box , but on the contrary it was much higher - ~75 – 100 microseconds! Both tests were done with Sun’s Hotspot JVM version 1.6.0-23.

Has anyone else done any similar tests with similar results on Linux? Or does anyone know why it is so much slower on Linux (with better hardware), could it be that thread switching simply is this much slower on Linux compared to windows? If that’s the case, it’s seems like windows is actually much better suited for some kind of applications. Any help in helping me understanding the relatively high figures are much appreciated.

Edit:
After a comment from DaveC, I also did a test where I restricted the JVM (on the Linux machine) to a single core (i.e. all threads running on the same core). This changed the results dramatically - the latency went down to below 20 microseconds, i.e. better than the results on the Windows machine. I also did some tests where I restricted the producer thread to one core and the consumer thread to another (trying both to have them on the same socket and on different sockets), but this did not seem to help - the latency was still ~75 microseconds. Btw, this test application is pretty much all I'm running on the machine while performering test.

Does anyone know if these results make sense? Should it really be that much slower if the producer and the consumer are running on different cores? Any input is really appreciated.

Edited again (6 January):
I experimented with different changes to the code and running environment:

I upgraded the Linux kernel to 2.6.36.2 (from 2.6.26.2). After the kernel upgrade, the measured time changed to 60 microseconds with very small variations, from 75-100 before the upgrade. Setting CPU affinity for the producer and consumer threads had no effect, except when restricting them to the same core. When running on the same core, the latency measured was 13 microseconds.
In the original code, I had the producer go to sleep for 1 second between every iteration, in order to give the consumer enough time to calculate the elapsed time and print it to the console. If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly? My first guess was that the change had the effect that the producer called queue.put() before the consumer called queue.take(), so the consumer never had to block, but after playing around with a modified version of ArrayBlockingQueue, I found this guess to be false – the consumer did in fact block. If you have some other guess, please let me know. (Btw, if I let the producer call both Thread.sleep() and barrier.await(), the latency remains at 60 microseconds).
I also tried another approach – instead of calling queue.take(), I called queue.poll() with a timeout of 100 micros. This reduced the average latency to below 10 microseconds, but is of course much more CPU intensive (but probably less CPU intensive that busy waiting?).

Edited again (10 January) - Problem solved:
ninjalj suggested that the latency of ~60 microseconds was due to the CPU having to wake up from deeper sleep states - and he was completely right! After disabling C-states in BIOS, the latency was reduced to <10 microseconds. This explains why I got so much better latency under point 2 above - when I sent objects more frequently the CPU was kept busy enough not to go to the deeper sleep states. Many thanks to everyone who has taken time to read my question and shared your thoughts here!

...

import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.CyclicBarrier;

public class QueueTest {

    ArrayBlockingQueue<Long> queue = new ArrayBlockingQueue<Long>(10);
    Thread consumerThread;
    CyclicBarrier barrier = new CyclicBarrier(2);
    static final int RUNS = 500000;
    volatile int sleep = 1000;

    public void start() {
        consumerThread = new Thread(new Runnable() {
            @Override
            public void run() {
                try {
                    barrier.await();
                    for(int i = 0; i < RUNS; i++) {
                        consume();

                    }
                } catch (Exception e) {
                    e.printStackTrace();
                } 
            }
        });
        consumerThread.start();

        try {
            barrier.await();
        } catch (Exception e) { e.printStackTrace(); }

        for(int i = 0; i < RUNS; i++) {
            try {
                if(sleep > 0)
                    Thread.sleep(sleep);
                produce();

            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public void produce() {
        try {
            queue.put(System.nanoTime());
        } catch (InterruptedException e) {
        }
    }

    public void consume() {
        try {
            long t = queue.take();
            long now = System.nanoTime();
            long time = (now - t) / 1000; // Divide by 1000 to get result in microseconds
            if(sleep > 0) {
                System.out.println("Time: " + time);
            }

        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    public static void main(String[] args) {
        QueueTest test = new QueueTest();
        System.out.println("Starting...");
        // Run first once, ignoring results
        test.sleep = 0;
        test.start();
        // Run again, printing the results
        System.out.println("Starting again...");
        test.sleep = 1000;
        test.start();
    }
}

have you tried the test on the linux box restricting the jvm to using only one cpu? might help identify where the time is going — DaveC, Jan 03 '11 at 12:07
Interesting - I tried to restrict it to a specific CPU by starting the application with the command 'taskset 0x00000001 java QueueTest', and the latency was reduced from around 75-100 to ~20 microseconds! I'm not sure I understand way though... — Johan, Jan 03 '11 at 13:08
@Johan:Are these times you report the same over many iterations?CyclicBarrier is used to coordinate threads working on indepenent tasks.Your tasks though are not independent.You have both producer and consumer wait on a barrier and then(once both threads reach the barrier point) they essentially start synchronising on the blocking queue.You could see interleavings of all sorts of combinations of scheduling reporting various delays. — Cratylus, Jan 03 '11 at 13:22
Yes, I print out every iteration and even though the results vary, the range is 75-100 microseconds. I used the CyclicBarrier differently first but in the code above it's not really necessary (the long sleep kind of ensures that the producer wont take a timestamp and try to put it on the queue before the consumer is ready, at least not after the first iteration) — Johan, Jan 03 '11 at 13:34
Not sure to what use this information is, but I ran lmbench on the system (which I have no experience of from before), which apparantly can be used to measure context switch latency and the figures I got were ~1 to 4 microseconds. — Johan, Jan 03 '11 at 13:36
@Johan:Once you reach the for loop, it is certain that the consumer is ready since you do a `barrier.await` before that.This returns only if both threads reached the barrier.So sleep is not helping at all. Also you call sleep in every iteration of `NRUNS` in the producer code. — Cratylus, Jan 03 '11 at 13:40
Well, the sleep is helping because it gives the consumer enough time to take the timestamp from the queue, print it on the console and go back to queue to wait for a new object. Instead of the sleep, I could of course have used the CyclicBarrier within the loop to ensure that the consumer is actually ready to a take a new object from the queue when the producer put it there... but the sleep is sufficient, the queue size doesn't go above 1. — Johan, Jan 03 '11 at 13:57
@Johan:I lost you. Are you using sleep to synchronize producer-consumer?But BlockingQueue is already a synchronized collection. If the queue is empty the retrieval (consumer) blocks until there is element available.If the queue is full the insertion(producer) blocks until there is space free — Cratylus, Jan 03 '11 at 14:06
Hehe ok, I'll see if this helps clarifying - If I wouldn't use sleep to let the producer wait while the consumer is processing the object that it takes from queue (in this case calculating the time difference and print it to the console), the producer would be much faster than the consumer is this example, and there would be a risk that the producer takes a timestamp and tries to put it on the queue but blocks because the queue is full, and consequently the timestamp couldn't be used to measure the actual Queue latency. — Johan, Jan 03 '11 at 14:15
Running this on RHEL 5.5 a Dual Core E6750 2.66G GHz using the Sun JVM 1.6.0-21(32 bit) gives from 18-25 microseconds. Are you running a 64 bit jvm in both cases ? — nos, Jan 03 '11 at 19:13
Does this simulation accurately model your problem? Are you expecting to get 1 message per second or 1000s or more? Cos if you are looking to get a lot then you are not testing throughput here, which could be a lot better than your indicated latency shows. I would suggest taking out the sleep and checking the average diff times of a few thousand queued objects between a number of consumer threads. — DaveC, Jan 03 '11 at 19:21
@nos: Thanks for testing. I am running 64 bit on the Linux machine and 32 bit on the windows machine. Your results look similar to what I got when I restricted the JVM to only one core... but unfortunetaly I dont understand why there is such a hugh increase in latency when I'm utilizing more cores. — Johan, Jan 03 '11 at 19:56
@DaveC: I wouldn't say that it fully reflects my "real" application, but messages could arrive at any rate for a given consumer (different consumers will handle different messages), so it is realistic that the consumer thread could have been blocking for a second when a new message arrives. — Johan, Jan 03 '11 at 20:01
cpu affinity reduces latency because it means you are controlling which cores are involved in the transfer. By pinning the whole process to a single core then, in the absence of any other activity on the box, your code/data stays local to that cpu (broadly speaking) — Matt, Jan 03 '11 at 20:34
Try to remove the sleeping part (altogether, the queue boundary will take care about the sleep, itself) and printing in the consumer thread, instead sum elapsed time and when the consumer ends print out the avg and/or median of the results, you should have 4-6micros on avg if you don't switch the context away with sleep each time. — bestsss, Jan 06 '11 at 15:58
@bestsss: Yes, if I removed the sleep and set the queue boundary to 1, the queue boundary would take care of it in the sense that the producer would block, but the timestamp put on the queue would then be old. I tried another approach (described under Edit again above): I let both the producer and consumer call barrier.await() in every iteration. The producer calls it before taking a timestamp and putting it on the queue, and the consumer calls it after printing the result to the console. This should ensure that the consumer is ready when the the producer calls put(). Please see above.. — Johan, Jan 06 '11 at 16:13
posted some snippet. Check it out, don't alter the the cpu affinity. Also use -server when you test any micro benchmarks. — bestsss, Jan 06 '11 at 18:12
The new Linux kernel 2.6.37 does not depend on Big Kernel Lock (BKL) anymore which should improve multi processor performance. Does upgrading the kernel to that version improve the times? — DerMike, Jan 06 '11 at 19:18
@Dermike: no, that doesn't affect this microbenchmark at all. As Matt said, increased latency on multi-CPU is probably due to CPU locality, i.e. cache-line bouncing between CPUs. — ninjalj, Jan 06 '11 at 19:50

Matt · Answer 1 · 2011-01-06T21:20:24.437

6

Your test is not a good measure of queue handoff latency because you have a single thread reading off the queue which writes synchronously to System.out (doing a String and long concatenation while it is at it) before it takes again. To measure this properly you need to move this activity out of this thread and do as little work as possible in the taking thread.

You'd be better off just doing the calculation (then-now) in the taker and adding the result to some other collection which is periodically drained by another thread that outputs the results. I tend to do this by adding to an appropriately presized array backed structure accessed via an AtomicReference (hence the reporting thread just has to getAndSet on that reference with another instance of that storage structure in order to grab the latest batch of results; e.g. make 2 lists, set one as active, every x s a thread wakes up and swaps the active and the passive ones). You can then report some distribution instead of every single result (e.g. a decile range) which means you don't generate vast log files with every run and get useful information printed for you.

FWIW I concur with the times Peter Lawrey stated & if latency is really critical then you need to think about busy waiting with appropriate cpu affinity (i.e. dedicate a core to that thread)

EDIT after Jan 6

If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly?

You're looking at the difference between java.util.concurrent.locks.LockSupport#park (and corresponding unpark) and Thread#sleep. Most j.u.c. stuff is built on LockSupport (often via an AbstractQueuedSynchronizer that ReentrantLock provides or directly) and this (in Hotspot) resolves down to sun.misc.Unsafe#park (and unpark) and this tends to end up in the hands of the pthread (posix threads) lib. Typically pthread_cond_broadcast to wake up and pthread_cond_wait or pthread_cond_timedwait for things like BlockingQueue#take.

I can't say I've ever looked at how Thread#sleep is actually implemented (cos I've never come across something low latency that isn't a condition based wait) but I would imagine that it causes it to be demoted by the schedular in a more aggressive way than the pthread signalling mechanism and that is what accounts for the latency difference.

edited Jan 06 '11 at 21:20

answered Jan 03 '11 at 20:08

Matt

8,367
4
31
61

Thank you for your input. In this particular test I don't think the synchronous write to System.out should be a problem though since I am having the producer thread waiting for 2 seconds before it puts a new timestamp on the queue.... unless I'm missing something here? Your solution for logging timestamps with two lists and AtomicReferences sounds like a great way to log latencies in my "real" application. – Johan Jan 03 '11 at 20:23
yes fair point, I missed the sleep before produce. The time is dominated by wakeup time then. You should be able to see this by adding the ability to vary the rate at which you offer timestamps to the queue, latency will tend down as you increase the offer rate unless something weird is going on on the box. – Matt Jan 03 '11 at 20:32
@Matt:Your points are very good.But never the less, why the same code varies so much between Windows and Linux is still unexplained – Cratylus Jan 03 '11 at 22:20
this is largely a CPU question, if you run this across n hardware platforms just varying the OS (and using "equivalent" config for each OS) then you should see the CPU as the key factor. In this case there is just so much that varies across the benchmark runs; OS, hardware, server activity to name 3. The linux numbers are, IMO, clearly anomalous (no idea whether the windows ones are reasonable). There is no mention of the activity on the debian box or how scheduling is configured & the kernel is pretty old. These can all be massive factors when measuring latencies of this magnitude. – Matt Jan 03 '11 at 23:17
Twp other things to bear in mind; 1) the comparison is a workstation vs a server and power management can have a massive impact here, for low latency you need to run the server at full pelt all the time (i.e. tweak bios to turn off every single piece of power management you can). 2) this is a 2 core non hyper threaded cpu in a single socket vs 4 core hyper threaded cpu in 2 sockets. These are not remotely comparable platforms for a context switching test. – Matt Jan 03 '11 at 23:22
@Matt: Many thanks for input. I think you are right, since the latency was actually lower on the linux machine when I restricted all JVM threads to one single core, I don't think this is a question about performance in Linux vs Windows anymore. However, I still think it's odd that the latency is so much higher (<20 vs ~75 microseconds) when running both threads on the same core vs seperate cores, or what do you think? Thanks – Johan Jan 04 '11 at 13:19
I'd expect there to be a bit of a difference between the 2 cases but not that much though tbh both your numbers are curiously high. Worst case you have a QPI hop to contend with, next worst is going via the shared L3 cache only and best case is same L1/L2 (iirc westmere is dedicated L1/L2 but shared L3 per socket). Clearly there is "further" to go unless you've bound both threads to the same core. I'd expect more like <3 for the same core case and <10 for the separate sockets assuming nothing else is going on. These things have a habit of confounding expectations mind you. – Matt Jan 04 '11 at 16:05
@Johan: It's not odd that the latency increases when going from one core to separate cores, synchronization between CPUs is not free, cache lines need to bounce between CPUs. For a measurement of that effect, you could try profiling cache misses using oprofile/vtune. – ninjalj Jan 06 '11 at 19:56
@ninjalj: Thank you for your comment. I do realize now that it's not odd that the latency increases when the two threads are running on different cores, but what I don’t understand is: When the producer puts objects on the queue with a one second interval it takes about 50 microseconds longer for the consumer to wake-up and receive the object compared to if the producer puts the object on the queue just after the consumer has blocked on queue.take() – Johan Jan 07 '11 at 09:36
Well, you have probably hundreds of threads in the system (all processes) and most (all) of them are waiting for some event and scheduling time. So it takes time "to wake up" some threads and do the context switch. The context switch is what causes the latency. After 1 sec of delay probably non of the threads will be in the cache and cache misses are what slows down the systems nowadays. – bestsss Jan 07 '11 at 14:31
1

@Johan: maybe it's due to the CPU having to wake up from deeper sleep states. Latencies for C2/C3 states seem to be in that range (0/1/64/96 uS for C0/C1/C2/C3 on a Proliant DL360). – ninjalj Jan 07 '11 at 19:23
@Johan: on Linux, powertop shows you the percent of time the CPU spends on each sleep state. – ninjalj Jan 07 '11 at 19:24
@ninjalj: You are completely right! After disabling C-states, the latency I got was reduced to <10 micro (compared to the ~60 I had before). This also explained why I got so much better latency when not calling Thread.sleep but instead waiting for the consumer to become ready using a barrier. Many thanks for this! (I must admit that I did not know about C-states before) – Johan Jan 10 '11 at 08:56

score 4 · Answer 2 · answered Jan 03 '11 at 12:51

I would use just an ArrayBlockingQueue if you can. When I have used it the latency was between 8-18 microseconds on Linux. Some point of note.

The cost is largely the time it takes to wake up the thread. When you wake up a thread its data/code won't be in cache so you will find that if you time what happens after a thread has woken that can take 2-5x longer than if you were to run the same thing repeatedly.
Certain operations use OS calls (such as locking/cyclic barriers) these are often more expensive in a low latency scenario than busy waiting. I suggest trying to busy wait your producer rather than use a CyclicBarrier. You could busy wait your consumer as well but this could be unreasonably expensive on a real system.

Thank you for your answer. I realize that most of the time is spent to wake up the consumer thread, but I thought a context switch in Linux would be much cheaper than the figures I get. I'm not sure if I fully understand your second point - the CyclicBarrier is only used once here (not really necessary), not in every iteration when a new timestamp is sent. — Johan, Jan 03 '11 at 13:10

bestsss · Answer 3 · 2011-01-06T18:11:22.993

@Peter Lawrey

Certain operations use OS calls (such as locking/cyclic barriers)

Those are NOT OS (kernel) calls. Implemented via simple CAS (which on x86 comes w/ free memory fence as well)

One more: dont use ArrayBlockingQueue unless you know why (you use it).

@OP: Look at ThreadPoolExecutor, it offers excellent producer/consumer framework.

Edit below:

to reduce the latency (baring the busy wait), change the queue to SynchronousQueue add the following like before starting the consumer

...
consumerThread.setPriority(Thread.MAX_PRIORITY);
consumerThread.start();

This is the best you can get.

Edit2: Here w/ sync. queue. And not printing the results.

package t1;

import java.math.BigDecimal;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.SynchronousQueue;

public class QueueTest {

    static final int RUNS = 250000;

    final SynchronousQueue<Long> queue = new SynchronousQueue<Long>();

    int sleep = 1000;

    long[] results  = new long[0];
    public void start(final int runs) throws Exception {
        results = new long[runs];
        final CountDownLatch barrier = new CountDownLatch(1);
        Thread consumerThread = new Thread(new Runnable() {
            @Override
            public void run() {
                barrier.countDown();
                try {

                    for(int i = 0; i < runs; i++) {                        
                        results[i] = consume(); 

                    }
                } catch (Exception e) {
                    return;
                } 
            }
        });
        consumerThread.setPriority(Thread.MAX_PRIORITY);
        consumerThread.start();


        barrier.await();
        final long sleep = this.sleep;
        for(int i = 0; i < runs; i++) {
            try {                
                doProduce(sleep);

            } catch (Exception e) {
                return;
            }
        }
    }

    private void doProduce(final long sleep) throws InterruptedException {
        produce();
    }

    public void produce() throws InterruptedException {
        queue.put(new Long(System.nanoTime()));//new Long() is faster than value of
    }

    public long consume() throws InterruptedException {
        long t = queue.take();
        long now = System.nanoTime();
        return now-t;
    }

    public static void main(String[] args) throws Throwable {           
        QueueTest test = new QueueTest();
        System.out.println("Starting + warming up...");
        // Run first once, ignoring results
        test.sleep = 0;
        test.start(15000);//10k is the normal warm-up for -server hotspot
        // Run again, printing the results
        System.gc();
        System.out.println("Starting again...");
        test.sleep = 1000;//ignored now
        Thread.yield();
        test.start(RUNS);
        long sum = 0;
        for (long elapsed: test.results){
            sum+=elapsed;
        }
        BigDecimal elapsed = BigDecimal.valueOf(sum, 3).divide(BigDecimal.valueOf(test.results.length), BigDecimal.ROUND_HALF_UP);        
        System.out.printf("Avg: %1.3f micros%n", elapsed); 
    }
}

score 0 · Answer 4 · answered Jan 03 '11 at 23:45

0

If latency is critical and you do not require strict FIFO semantics, then you may want to consider JSR-166's LinkedTransferQueue. It enables elimination so that opposing operations can exchange values instead of synchronizing on the queue data structure. This approach helps reduce contention, enables parallel exchanges, and avoids thread sleep/wake-up penalties.

answered Jan 03 '11 at 23:45

Ben Manes

9,178
3
35
39

Thank you, I will look into LinkedTransferQueue and see if it's suitable for my application. – Johan Jan 04 '11 at 12:48

Java BlockingQueue latency high on Linux

4 Answers4

Linked