Why threads are showing better performance than coroutines?

Question

I have written 3 simple programs to test coroutines performance advantage over threads. Each program does a lot of common simple computations. All programs were run separately from each other. Besides execution time I measured CPU usage via Visual VM IDE plugin.

First program does all computations using 1000-threaded pool. This piece of code shows the worst results (64326 ms) comparing to others because of frequent context changes:

val executor = Executors.newFixedThreadPool(1000)
time = generateSequence {
  measureTimeMillis {
    val comps = mutableListOf<Future<Int>>()
    for (i in 1..1_000_000) {
      comps += executor.submit<Int> { computation2(); 15 }
    }
    comps.map { it.get() }.sum()
  }
}.take(100).sum()
println("Completed in $time ms")
executor.shutdownNow()

Second program has the same logic but instead of 1000-threaded pool it uses only n-threaded pool (where n equals to amount of the machine's cores). It shows much better results (43939 ms) and uses less threads which is good too.

val executor2 = Executors.newFixedThreadPool(4)
  time = generateSequence {
  measureTimeMillis {
    val comps = mutableListOf<Future<Int>>()
    for (i in 1..1_000_000) {
      comps += executor2.submit<Int> { computation2(); 15 }
    }
    comps.map { it.get() }.sum()
  }
}.take(100).sum()
println("Completed in $time ms")
executor2.shutdownNow()

Third program is written with coroutines and shows a big variance in the results (from 41784 ms to 81101 ms). I am very confused and don't quite understand why they are so different and why coroutines sometimes slower than threads (considering small async calculations is a forte of coroutines). Here is the code:
```
time = generateSequence {
  runBlocking {
    measureTimeMillis {
      val comps = mutableListOf<Deferred<Int>>()
      for (i in 1..1_000_000) {
        comps += async { computation2(); 15 }
      }
      comps.map { it.await() }.sum()
    }
  }
}.take(100).sum()
println("Completed in $time ms")
```

I actually read a lot about these coroutines and how they are implemented in kotlin, but in practice I don't see them working as intended. Am I doing my benchmarking wrong? Or maybe I'm using coroutines wrong?

You are using the default coroutine dispatcher (which is the `CommonPool`) in your coroutine example. Try using the same kind of threadpool as you use in your other tests. — marstran, Jan 05 '18 at 08:46
Please publish the code of `computation2()`. The results kinda depend on what you are doing, to put it mildly — voddan, Jan 05 '18 at 10:41

Marko Topolnik · Accepted Answer · 2020-03-25T15:41:04.130

The way you've set up your problem, you shouldn't expect any benefit from coroutines. In all cases you submit a non-divisible block of computation to an executor. You are not leveraging the idea of coroutine suspension, where you can write sequential code that actually gets chopped up and executed piecewise, possibly on different threads.

Most use cases of coroutines revolve around blocking code: avoiding the scenario where you hog a thread to do nothing but wait for a response. They may also be used to interleave CPU-intensive tasks, but this is a more special-cased scenario.

I would suggest benchmarking 1,000,000 tasks that involve several sequential blocking steps, like in Roman Elizarov's KotlinConf 2017 talk:

suspend fun postItem(item: Item) {
    val token = requestToken()
    val post = createPost(token, item)
    processPost(post)
}

where all of requestToken(), createPost() and processPost() involve network calls.

If you have two implementations of this, one with suspend funs and another with regular blocking functions, for example:

fun requestToken() {
   Thread.sleep(1000)
   return "token"
}

vs.

suspend fun requestToken() {
    delay(1000)
    return "token"
}

you'll find that you can't even set up to execute 1,000,000 concurrent invocations of the first version, and if you lower the number to what you can actually achieve without OutOfMemoryException: unable to create new native thread, the performance advantage of coroutines should be evident.

If you want to explore possible advantages of coroutines for CPU-bound tasks, you need a use case where it's not irrelevant whether you execute them sequentially or in parallel. In your examples above, this is treated as an irrelevant internal detail: in one version you run 1,000 concurrent tasks and in the other one you use just four, so it's almost sequential execution.

Hazelcast Jet is an example of such a use case because the computation tasks are co-dependent: one's output is another one's input. In this case you can't just run a few of them until completion, on a small thread pool, you actually have to interleave them so the buffered output doesn't explode. If you try to set up such a scenario with and without coroutines, you'll once again find that you're either allocating as many threads as there are tasks, or you are using suspendable coroutines, and the latter approach wins. Hazelcast Jet implements the spirit of coroutines in plain Java API. Its approach would hugely benefit from the coroutine programming model, but currently it's pure Java.

^{Disclosure: the author of this post belongs to the Jet engineering team.}

Isn't `Thread.sleep(1000)` a better example of a network call? If I am right, a network call is always blocking a thread. Suspension is just a fancy way to offload a task to another thread while freeing the calling thread. Am I right in my understanding? — Mangat Rai Modi, Jan 02 '23 at 23:26
I understand the idea of the divisible work, but that means coroutines will only provide the gains when working with the libs which provide the locations from there code can be suspended? In any other lib, the task is going to be a big blocking call. Which is the case with most of the libraries in the ecosystem. — Mangat Rai Modi, Jan 02 '23 at 23:31
@MangatRaiModi That depends on what you're using Kotlin for. On Android, all the commonly used libraries are suspendable. In the enterprise ecosystem I guess things are different. Network calls aren't fundamentally blocking (actually, on the low level they are fundamentally non-blocking), and the JDK has long had non-blocking networking in NIO, used mostly through the Netty library. — Marko Topolnik, Jan 03 '23 at 08:03
Thanks a lot. If I use Ktor, then I am golden. I wish more community adapt for Kotlin. — Mangat Rai Modi, Jan 03 '23 at 08:41
Again, If I use Netty, Kotlin still doesn't know where to suspend? Am I right? Unless I use some Kotlin implementation. — Mangat Rai Modi, Jan 03 '23 at 08:42
Yes, you'd need a Kotlin wrapper over Netty or, better, a higher-level library that uses Netty internally -- just like Ktor. — Marko Topolnik, Jan 03 '23 at 09:00
That will give us the further interleaving. Although I wonder if that would lead to better throughput than simply using a threadpool. — Mangat Rai Modi, Jan 03 '23 at 12:12

score 10 · Answer 2 · answered Jan 05 '18 at 04:51

10

Coroutines are not designed to be faster than threads, it is for lower RAM consumption and better syntax for async calls.

answered Jan 05 '18 at 04:51

garywzh

540
3
9

3

but coroutines aren't designed to be slower than threads, and the fact that coroutines are designed to be more lightweight than threads should qualify them for being faster too - although that particular benchmark shows they aren't neccessarily – msrd0 Jan 05 '18 at 09:07
6

No one said that "coroutines are designed to be slower than threads". It is just a side effect. "lightweight" does not mean "it should qualify then for being faster", "lightweight" means it uses less memory. – garywzh Jan 05 '18 at 10:21
1

lightweight can apply to memory or to cpu or to both - and I dont say that lightweight always means being faster, but I often see that as a side-effect – msrd0 Jan 05 '18 at 10:30
And still you got no explanation why coroutines are slower than threads - you're just telling us not to expect that – msrd0 Jan 05 '18 at 10:31
3

Just as a side note coroutines actually use treads, but they are set up in such a way that you can spread a workload on multiple threads while still being thread safe because corutines can wait for other corutines to finish without blocking the thread they are running on. So they are neither slower of faster then threads just some workloads will benefit a lot from this concept which makes that workload faster, while other workloads are slower because of the inherit overhead. – Mihai Jun 18 '18 at 12:05
I think the reason it could get slightly slower is the addition in user level to support the implementation of coroutines, more instructions, more CPU cycles – Liu Dec 14 '22 at 11:47

Weidian Huang · Answer 3 · 2019-09-11T04:10:09.350

Coroutines are designed to be lightweight threads. It uses lower RAM, because when you execute 1,000,000 concurrent routines, it doesn't have to create 1,000,000 threads. Coroutine can help you to optimise the threads usage, and make the execution more efficiency, and you don't need to care about the threads anymore. You can consider a coroutine as a runnable or task, which you can post into a handler and executed in a thread or threadpool.

Why threads are showing better performance than coroutines?

3 Answers3

Linked