How to correctly join all jobs launched in a CoroutineScope

Question

I'm refactoring some Kotlin code that currently launches coroutines on the GlobalScope to a structured concurrency-based approach. I need to join all of the jobs launched in my code before the JVM exits. My class can be broken down to the following interface:

interface AsyncTasker {
    fun spawnJob(arg: Long)
    suspend fun joinAll()
}

Usage:

fun main(args: Array<String>) {
    val asyncTasker = createAsyncTasker()

    asyncTasker.spawnJob(100)
    asyncTasker.spawnJob(200)
    asyncTasker.spawnJob(300)
    asyncTasker.spawnJob(500)

    // join all jobs as they'd be killed when the JVM exits
    runBlocking {
        asyncTasker.joinAll()
    }
}

My GlobalScope-based implementation looks as follows:

class GlobalScopeAsyncTasker : AsyncTasker {
    private val pendingJobs = mutableSetOf<Job>()

    override fun spawnJob(arg: Long) {
        var job: Job? = null
        job = GlobalScope.launch(Dispatchers.IO) {
            someSuspendFun(arg)
            pendingJobs.remove(job)
        }
        pendingJobs.add(job)
    }

    override suspend fun joinAll() {
        // iterate over a copy of the set as the
        // jobs remove themselves from the set when we join them
        pendingJobs.toSet().joinAll()
    }
}

Clearly, this is not ideal, as keeping track of every pending job isn't very elegant and a remnant of old thread-based coding paradigms.

As a better approach, I'm creating my own CoroutineScope which is used to launch all children, providing a SupervisorJob.

class StructuredConcurrencyAsyncTasker : AsyncTasker {

    private val parentJob = SupervisorJob()
    private val scope = CoroutineScope(Dispatchers.IO + parentJob)

    override fun spawnJob(arg: Long) {
        scope.launch {
            someSuspendFun(arg)
        }
    }

    override suspend fun joinAll() {
        parentJob.complete() // <-- why is this needed??
        parentJob.join()
    }
}

When initially developing this solution, I omitted the call to parentJob.complete(), which caused join() to suspend indefinitely. This feels very unintuitive, so I'm looking for confirmation/input whether this is the correct way to solve this kind of problem. Why do I have to manually complete() the parent job? Is there an even simpler way to solve this?

Kotlin playground with the code

Does this answer your question? [How to join a Kotlin SupervisorJob](https://stackoverflow.com/questions/53916377/how-to-join-a-kotlin-supervisorjob) — Roland, Feb 02 '21 at 16:01
if you do not say explicitly that the parent job or its children are complete, it will run forever... i.e. if you just call `join` on the parent, it will wait until all the children coroutines are completed (that's also stated in the documentation). The launched coroutine jobs however are still active (or at least not completed) and that is why the parent job hangs there... However I do not know why this was designed this way... — Roland, Feb 02 '21 at 16:06
@Roland from the `Job` documentation: "Coroutine job is created with launch coroutine builder. It runs a specified block of code and completes on completion of this block." Since `someSuspendFun` has already returned when I join, the child jobs are completed. In fact, they are not even in the `parentJob`'s sequence of `children` anymore. So if I understand correctly, `join` on the parent job simply hangs because it waits for the parent job itself to be completed, which I have to initiate manually? Weird design, but I guess it makes sense. — CrushedPixel, Feb 02 '21 at 17:25
I figured out why this behaviour makes sense. See my own answer to this question. Thanks for your input, @Roland! — CrushedPixel, Feb 02 '21 at 17:36

score 2 · Answer 1 · answered Feb 03 '21 at 16:59

I wonder whether this behaviour will change in future. For now the answer in the linked question still holds. For now parentJob.join() doesn't join its children. For me the following part of the Job#join()-documentation was the reason to dig deeper:

Note that the job becomes complete only when all its children are complete.

Note that the launched coroutine jobs may have been in another state than completed. You may want to verify that by something like parentJob.children.forEach { println(it) } (or whichever information you want to inspect or you may want to debug it ;-)) before your parentJob.join()-statement.

There are (at least?) two ways to ensure that all launched children coroutine jobs are completed, so that it doesn't hang at that point or complete too early:

Waiting for all children jobs to be completed first (as also stated in the linked answer in the comments), i.e.:
```
parentJob.children.forEach { it.join() }
```
This doesn't require an additional parentJob.join() or parentJob.complete() and is therefore probably preferred? The parentJob will complete when all its children complete.
Calling complete before calling join, i.e.:
```
parentJob.complete()
parentJob.join()
```
Note that calling complete here just transitions the state to completing as is also stated in the Job documentation. In the completing state it will wait for its children to be completed as well. If you just call complete() without the join the program will probably exit, before even running your launched coroutine jobs. And if you only join() it may suspend indefinitely as you already experienced.

score 1 · Answer 2 · answered Feb 02 '21 at 17:35

1

From the documentation of Job#join():

This invocation resumes [...] when the job is complete for any reason

Since I've never marked the parent job as Completed, join never returns, even if all of the job's children are Completed.

This makes sense considering that a job can't ever switch state from Completed back to Active, so if it automatically switched state to Completed when all children are Completed, it wouldn't be possible to add more child jobs at a later point in time.

Thanks to Roland for pointing me in the right direction.

answered Feb 02 '21 at 17:35

CrushedPixel

1,152
2
13
26

wanted to comment... but it got longer than expected, so I added an answer.. ... you may want to verify the actual state of the jobs before calling `parentJob.join()`... the state may not have been `completed` yet and therefore it got stuck... as it needs to wait for the job to complete... maybe a race condition that will be fixed in future? Or maybe the supervisor job will join all its children in future? – Roland Feb 03 '21 at 17:01

How to correctly join all jobs launched in a CoroutineScope

2 Answers2