Why is this extension function slower than non extension counterpart?

Question

I was trying to write a parallel map extension function to do map operation over a List in parallel using coroutines. However there is a significant overhead in my solution and I can't find out why.

This is my implementation of the pmap extension function:

fun <T, U> List<T>.pmap(scope: CoroutineScope = GlobalScope,
                    transform: suspend (T) -> U): List<U> {
    return map { i -> scope.async { transform(i) } }.map { runBlocking { it.await() } }
}

However, when I do the exact same operation in a normal function, it takes up to extra 100ms (which is a lot). I tried using inline but it had no effect.

I'm leaving here the full test I've done to demonstrate this behavior:

import kotlinx.coroutines.*
import kotlin.system.measureTimeMillis

fun main() {
    test()
}

fun <T, U> List<T>.pmap(scope: CoroutineScope = GlobalScope,
                    transform: suspend (T) -> U): List<U> {
    return this.map { i -> scope.async { transform(i) } }.map { runBlocking { it.await() } }
}

fun test() {
    val list = listOf<Long>(100,200,300)

    val transform: suspend (Long) -> Long = { long: Long ->
        delay(long)
        long*2
    }

    val timeTakenPmap = measureTimeMillis {
        list.pmap(GlobalScope) { transform(it) }
    }

    val manualpmap = measureTimeMillis {
        list.map { GlobalScope.async { transform(it) } }
            .map { runBlocking { it.await() } }
    }

    val timeTakenMap = measureTimeMillis {
        list.map { runBlocking { transform(it) } }
    }

    println("pmapTime: $timeTakenPmap - mapTime: $timeTakenMap - manualpmap: $manualpmap")
}

It can be run in kotlin playground: https://pl.kotl.in/CIXVqezg3

In the playground it prints this result: pmapTime: 411 - mapTime: 602 - manualpmap: 302

MapTime and manualPmap give reasonable results, only 2ms of time outside the delays. But pmapTime is way off. And the code between manualpmap and pmap looks exactly the same to me.

In my own machine it runs a little faster, pmap takes around 350ms.

Does anyone know why this happens?

This is not a valid way to benchmark code. You're ignoring warmup time, for one thing. And the first time you start creating coroutines, threads have to be created. Subsequent coroutines can reuse thread instances from the pools. And your job size of only three items is insignificantly small. Look into benchmarking libraries. — Tenfour04, Dec 02 '21 at 14:08
Oh, thanks. I totally forgot about the thread creation time. I'm completely sure that's the reason. Thanks. — Fedelway, Dec 02 '21 at 14:38

Joffrey · Answer 1 · 2021-12-02T14:29:03.097

First of all, manual benchmarks like this are usually of very little significance. There are many things that can be optimized away by the compiler or the JIT and any conclusion can be quite wrong. If you really want to compare things, you should instead use benchmarking libraries which take into account JVM warmup etc.

Now, the overhead you see (if you could confirm there was an actual overhead) might be caused by the fact that your higher-order extension is not marked inline, so instances of the lambda you pass need to be created - but as @Tenfour04 noted there are many other possible reasons: thread pool lazy initialization, significance of the list size, etc.

That being said, this is really not an appropriate way to write parallel map, for several reasons:

GlobalScope is a pretty bad default in general, and should be used in very specific situations only. But don't worry about it because of the next point.
You don't need an externally provided CoroutineScope if the coroutines you launch do not outlive your method. Instead, use coroutineScope { ... } and make your function suspend, and the caller will choose the context if they need to
map { it.await() } is inefficient in case of errors: if the last element's transformation immediately fails, map will wait for all previous elements to finish before failing. You should prefer awaitAll which takes care of this.
runBlocking should be avoided in coroutines (blocking threads in general, especially when you don't control which thread you're blocking), so using it in deep library-like functions like this is dangerous, because it will likely be used in coroutines at some point.

Applying those points gives:

suspend inline fun <T, U> List<T>.pmap(transform: suspend (T) -> U): List<U> {
    return coroutineScope {
        map { async { transform(it) } }.awaitAll()
    }
}

Also the list being tested only has 3 items in it. – Tenfour04 Dec 02 '21 at 14:12 — Tenfour04, Dec 02 '21 at 14:12
The reason is totally thread creation time. – Fedelway Dec 02 '21 at 14:41 — Fedelway, Dec 02 '21 at 14:41

Why is this extension function slower than non extension counterpart?

1 Answers1