Are stackless C++20 coroutines a problem?

Question

Based on the following it looks like coroutines in C++20 will be stackless.

https://en.cppreference.com/w/cpp/language/coroutines

I'm concerned for many reasons:

On embedded systems heap allocation is often not acceptable.
When in low level code, nesting of co_await would be useful (I don't believe stackless co-routines allow this).

With a stackless coroutine, only the top-level routine may be suspended. Any routine called by that top-level routine may not itself suspend. This prohibits providing suspend/resume operations in routines within a general-purpose library.

https://www.boost.org/doc/libs/1_57_0/libs/coroutine/doc/html/coroutine/intro.html#coroutine.intro.stackfulness

More verbose code because of the need for custom allocators and memory pooling.
Slower if the task waits for the operating system to allocate it some memory (without memory pooling).

Given these reasons, I'm really hoping I'm very wrong about what the current coroutines are.

The question has three parts:

Why would C++ choose to use stackless coroutines?
Regarding allocations to save state in stackless coroutines. Can I use alloca() to avoid any heap allocations that would normally be used for the coroutine creation.

coroutine state is allocated on the heap via non-array operator new. https://en.cppreference.com/w/cpp/language/coroutines

Are my assumptions about the c++ coroutines wrong, why?

EDIT:

I'm going through the cppcon talks for the coroutines now, if I find any answers to my own question I'll post it (nothing so far).

CppCon 2014: Gor Nishanov "await 2.0: Stackless Resumable Functions"

https://www.youtube.com/watch?v=KUhSjfSbINE

CppCon 2016: James McNellis “Introduction to C++ Coroutines"

https://www.youtube.com/watch?v=ZTqHjjm86Bw

Have a look at the coroutine papers in the post Kona mailing of this year. They explain the rationale for the current design and its drawbacks. — Rakete1111, Jul 23 '19 at 11:57
This one? http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1493r0.pdf — David Ledger, Jul 23 '19 at 11:59
Stackful coroutines means "I allocate an entite thread-style stack", not "I use the caller's stack space". You are confusing two separate issues; stackful vs stackless, and the ability to store coroutine state in automatic storage. The degree to which you are confusing it makes the question incoherant, as most stackful coroutines cannot live in someone else's stack. Meanwhile for stackless, living in someone rlse's automatic storage is plausible. — Yakk - Adam Nevraumont, Jul 23 '19 at 12:05
The section "Embedded (no-alloc) generators" looks to me, naively, like it might be of practical interest — jwimberley, Jul 23 '19 at 12:17
After reviewing all the information related to coroutines, my take-away is whether they're stackless, or not, coroutines are a solution in search of a problem. — Sam Varshavchik, Jul 23 '19 at 12:20
@jwimberley That looked good to me too , looks like it won't make it to C++20. :| — David Ledger, Jul 23 '19 at 12:29
Looking at their pros and cons lists, looks a bit one sided. — David Ledger, Jul 23 '19 at 12:33
@Yakk-AdamNevraumont Hmm, still not fully understanding what you mean. I'm confused because on cppreference it says "coroutine state is allocated on the heap via non-array operator new.". Which is pretty clearly on the heap. If I overloaded the allocator could I put it on the stack with alloca? — David Ledger, Jul 23 '19 at 12:56
@DavidLedger So, imagine someone complaining about gun control. And their complaint mixes "people who cannot control their shooting" with "regulations to control who can own guns". You are mixing **two different things** using the same term (stackless). It is true that your two different things are both valid issues we could discuss, but when you use **one term** to refer to both and don't seem to understand they are **two issues** it is really hard to communicate about it. — Yakk - Adam Nevraumont, Jul 23 '19 at 13:24
What more, the two different "stackful" issues you are talking about are **opposed** to each other. An on-stack coroutine (one stored in the automatic storage of the creator) is **not going to be stackful**, because there generally isn't room for the coroutine to have its own stack. **stackful coroutines** means **the coroutine has a stack**. Almost any coroutine implementation that lives in automatic storage of its creator (an on-stack coroutine) is going to be **stackless**. — Yakk - Adam Nevraumont, Jul 23 '19 at 13:26
@Yakk - Adam Nevraumont Ok thanks, I'm doing some research now to understand the differences. I'll improve the wording when I understand abit more. Appreciate the clarification, I think my questions still stand (although my wording may need work). — David Ledger, Jul 23 '19 at 13:26
I say "generally isn't going to be stackful" because I've seen setjmp/longjmp coroutines that divided the parent stack up into pieces and shared it. But that is a horrid hack that doesn't really save any resources and made other problems; it just was a way to hack coroutines into a language that didn't support them. — Yakk - Adam Nevraumont, Jul 23 '19 at 13:28
I think Golang does that. The talk "await 2.0: Stackless Resumable Functions" sort of goes into that at the beginning. — David Ledger, Jul 23 '19 at 13:31
@Yakk - Adam Nevraumont I have improved part 2 of the question. — David Ledger, Jul 23 '19 at 13:35
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/196874/discussion-between-david-ledger-and-yakk-adam-nevraumont). — David Ledger, Jul 23 '19 at 14:01
Maybe this has the info You are lookng for: CppCon 2016: Gor Nishanov “C++ Coroutines: Under the covers" https://www.youtube.com/watch?v=8C8NnE1Dg4A — Robert Andrzejuk, Jul 23 '19 at 14:07

score 92 · Answer 1 · edited Jun 20 '20 at 09:12

I use stackless coroutines on small, hard-realtime ARM Cortex-M0 targets, with 32kb of RAM, where there's no heap allocator present at all: all memory is statically preallocated. The stackless coroutines are a make-or-break, and stackful coroutines that I had previously used were a pain to get right, and were essentially a hack wholly based on implementation-specific behavior. Going from that mess to standards-compliant, portable C++, was wonderful. I shudder to think that someone might suggest going back.

Stackless coroutines don't imply heap use: you have full control over how the coroutine frame is allocated (via void * operator new(size_t) member in promise type).
co_await can be nested just fine, in fact it's a common use case.
Stackful coroutines have to allocate those stacks somewhere as well, and it's perhaps ironic that they can't use the thread's primary stack for that. These stacks are allocated on the heap, perhaps via a pool allocator that gets a block from heap and then subdivides it.
Stackless coroutine implementations can elide frame allocation, such that the promise's operator new is not called at all, whereas stackful coroutines always allocate the stack for the coroutine, whether needed or not, because the compiler can't help the coroutine runtime with eliding it (at least not in C/C++).
The allocations can be elided precisely by using the stack where the compiler can prove that the life of the coroutine doesn't leave the scope of the caller. And that's the only way you can use alloca. So, the compiler already takes care of it for you. How cool is that!

Now, there's no requirement that the compilers actually do this elision, but AFAIK all implementations out there do this, with some sane limits on how complex that "proof" can be - in some cases it's not a decidable problem (IIRC). Plus, it's easy to check whether the compiler did as you expected: if you know that all coroutines with a particular promise type are nested-only (reasonable in small embedded projects but not only!), you can declare operator new in the promise type but not define it, and then the code won't link if the compiler "goofed up".

A pragma could be added to a particular compiler implementation to declare that a particular coroutine frame doesn't escape even if the compiler isn't clever enough to prove it - I didn't check if anyone bothered to write these yet, because my use cases are reasonable enough that the compiler always does the right thing.

Memory allocated with alloca cannot be used after you return from the caller. The use case for alloca, in practice, is to be a slightly more portable way of expressing gcc's variable-size automatic array extension.

In essentially all implementations of stackful coroutines in C-like lanaguages, the one and only supposed "benefit" of stackfull-ness is that the frame is accessed using the usual base-pointer-relative addressing, and push and pop where appropriate, so "plain" C code can run on this made-up stack, with no changes to code generator. No benchmarks support this mode of thinking, though, if you have lots of coroutines active - it's a fine strategy if there's a limited number of them, and you have the memory to waste to start with.

Stack has to be overallocated, decreasing locality of reference: a typical stackful coroutine uses a full page for the stack at the minimum, and the cost of making this page available is not shared with anything else: the single coroutine has to bear it all. That's why it was worthwhile to develop stackless python for multiplayer game servers.

If there's a couple of couroutines only - no problem. If you've got thousands of network requests all handled by stackful coroutines, with a light networking stack that doesn't impose overhead that monopolizes the performance, the performance counters for cache misses will make you cry. As Nicol has stated in the other answer, this becomes somewhat less relevant the more layers there are between the coroutine and whatever asynchronous operation it's handling.

It has been long since any 32+-bit CPU had performance benefits inherent to memory access via any particular addressing mode. What matters is cache-friendly access patterns and leveraging prefetch, branch prediction and speculative execution. Paged memory and its backing store are just two further levels of cache (L4 and L5 on desktop CPUs).

Why would C++ choose to use stackless coroutines? Because they perform better, and no worse. On the performance side, there can be only benefits to them. So it's a no-brainer, performance-wise, to just use them.
Can I use alloca() to avoid any heap allocations that would normally be used for the coroutine creation. No. It'd be a solution to a nonexistent problem. Stackful coroutines don't actually allocate on the existing stack: they create new stacks, and those are allocated on the heap by default, just as C++ coroutine frames would be (by default).
Are my assumptions about the c++ coroutines wrong, why? See above.
More verbose code because of the need for custom allocators and memory pooling. If you want stackful coroutines to perform well, you'll be doing the same thing to manage the memory areas for the stacks, and it turns out that it's even harder. You need to minimize memory waste, and thus you need to minimally overallocate the stack for the 99.9% use case, and deal somehow with coroutines that exhaust this stack.

One way I have dealt with it in C++ was by doing stack checks in branch points where code analysis indicates more stack may be needed, then if the stack would overflow, an exception was thrown, the coroutine's work undone (the design of the system had to support it!), and then the work restarted with more stack. It's an easy way to quickly lose benefits of tightly packed stack-fuls. Oh, and I had to provide my own __cxa_allocate_exception for that to work. Fun, eh?

One more anecdote: I'm playing with using coroutines inside Windows kernel-mode drivers, and there the stacklessness does matter - to the extent that if the hardware allows, you can allocate the packet buffer and the coroutine's frame together, and these pages are pinned when they are submitted to the network hardware for execution. When the interrupt handler resumes the coroutine, the page is there, and if the network card allows, it could even prefetch it for you so it'll be in the cache. So that works well - it's just one use case, but since you wanted embedded - I've got embedded :).

It's perhaps not common to think of drivers on desktop platforms as "embedded" code, but I see lots of similarities, and an embedded mindset is needed. Last thing you want is kernel code that allocates too much, especially if it would add per-thread overhead. A typical desktop PC has a few thousand threads present, and a lot of them are there to handle I/O. Now imagine a diskless system that uses iSCSI storage. On such a system, anything I/O bound that's not bound to USB or GPU will be bound to the network hardware and the networking stack.

Finally: Trust benchmarks, not me, and read Nicol's answer too!. My perspective is shaped by my use cases - I can generalize, but I claim no first-hand experience with coroutines in "generalist" code where performance is of less concern. Heap allocations for stackless coroutines are very often hardly noticeable in performance traces. In general-purpose application code, it's rarely going to be a problem. It does get "interesting" in library code, and some patterns have to be developed to allow the library user to customize this behavior. These patterns will be found and popularized as more libraries use C++ coroutines.

I think it's important to distinguish 3 categories of coroutine implementations: stackful, stackless-on-heap, stackless-as-struct. Your answer quite thoroughly covers the stackful vs stackless-on-heap case, however it leaves out the potential stackless-as-struct approach. Stackless-as-struct is essentially creating an anonymous type (ala lambda), used to save the data across suspension points. This is the approach taken by Rust for its async/await implementation for example. I believe there were declaration/definition and ABI concerns about this approach for C++. — Matthieu M., Jul 24 '19 at 15:24
As I understand it the problem with what you're calling stackless-as-struct is that you don't know what size the struct is until after all the compiler optimisation passes, and those optimisation passes need to know how big every type is. The compiler guys said fixing that is impractical. So you'd need to make it big enough to store every local variable in the coroutine, recursively, even if most or all of those variables aren't needed across suspension points after optimisation. The pessimisation of making every coroutine much bigger than necessary was considered worse than maybe-allocating. — patstew, Jul 24 '19 at 17:23
@patstew: AFAIK, Rust uses stackless-as-struct for its async/await implementation, and its compiler (1) is not magical and (2) mostly delegates optimizations to LLVM, so running optimizations is unnecessary; although it may be desirable. I remembered I had [P1492R0](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1492r0.pdf) tucked away for reading and dug back in. stackless-as-struct is named *Early Split* in the paper, and at the bottom of page 8 we can read... cont. — Matthieu M., Jul 24 '19 at 19:07
@patstew: ... "While it is theoretically not an insurmountable challenge, it might be a major re-engineering effort to the front-end structure and experts on two compiler front ends (Clang, EDG) have indicated that this is not a practical approach." So the short answer seems to be that due to technical debt on the part of existing C++ compilers and their "straight" pipeline, the front-end cannot anticipate the size necessary for some book-keeping information traditionally handled by the code-generator. I wonder why lambdas do not have the issue, but I'll take Richard Smith's word on Clang. — Matthieu M., Jul 24 '19 at 19:09
Lambdas only store the captures in the object, and you never want to optimise those away (e.g. you may want to extend an object's lifetime even if it isn't directly referenced in the lambda). A coroutine has to allocate space for all local variables that are referenced across suspension points, and you want to store as few as possible. I expect that all the lifetime checking stuff in the rust frontend is very useful for that (and not particularly easy to bolt onto C++). — patstew, Jul 25 '19 at 12:51
“the *one and only*”? Being able to suspend from any function, much as any function can perform thread synchronization, seems (in some cases) to be a bigger conceptual advantage than the details of code generation. — Davis Herring, Jul 26 '19 at 03:31
"...stackless coroutines on small, hard-realtime ARM Cortex-M0..." sounds real interesting! Can I ask what compiler you're using? Based on a quick look GCC doesn't have coroutines even in GCC10, and arm-none-eabi-gcc seems to be about one to two major versions behind the mainline most times... — Timo, Feb 24 '20 at 07:40
@Timo Coroutines are, in a way, target-independent, so it's not that big of a problem to use whatever is the most recent compiler even if the runtime library support lags behind. I'm using "trunk" clang and it works fine. For a bleeding edge it's surprisingly bloodless. Disclaimer: all of the runtime library binaries are custom for those builds, and I compile libc++ as well (only parts of it). I'm not reusing any existing binaries, and I use llvm and lld as essentially stand-alone tools, in the most "bare" fashion possible. I'm not using ARM system abstractions - wrote my own in pure C++ :) — Kuba hasn't forgotten Monica, Mar 17 '20 at 15:38
"AFAIK all implementations out there do this" GCC, at least as of version 12.2, does not implement elision whatsoever. — rdb, Dec 28 '22 at 02:28

Nicol Bolas · Accepted Answer · 2019-07-23T15:34:42.110

Forward: When this post says just "coroutines", I am referring to the concept of a coroutine, not the specific C++20 feature. When talking about this feature, I will refer to it as "co_await" or "co_await coroutines".

On dynamic allocation

Cppreference sometimes uses looser terminology than the standard. co_await as a feature "requires" dynamic allocation; whether this allocation comes from the heap or from a static block of memory or whatever is a matter for the provider of the allocation. Such allocations can be elided in arbitrary circumstances, but since the standard does not spell them out, you still have to assume that any co_await coroutine may dynamically allocate memory.

co_await coroutines do have mechanisms for users to provide allocation for the coroutine's state. So you can substitute the heap/free store allocation for any particular pool of memory you prefer.

co_await as a feature is well-designed to remove verbosity from the point of use for any co_await-able objects and functionality. The co_await machinery is incredibly complicated and intricate, with lots of interactions between objects of several types. But at the suspend/resume point, it always looks like co_await <some expression>. Adding allocator support to your awaitable objects and promises requires some verbosity, but that verbosity lives outside of the place where those things get used.

Using alloca for a coroutine would be... highly inappropriate for most uses of co_await. While the discussion around this feature tries to hide it, the fact of the matter is that co_await as a feature is designed for asynchronous use. That's its intended purpose: to halt the execution of a function and schedule that function's resumption on potentially another thread, then shepherding any eventually generated value to some receiving code which may be somewhat distant from the code which invoked the coroutine.

alloca is not appropriate for that particular use case, since the caller of the coroutine is allowed/encouraged to go do whatever so that the value can be generated by some other thread. The space allocated by alloca would therefore no longer exist, and that is kind of bad for the coroutine that lives in it.

Also note that allocation performance in such a scenario will generally be dwarfed by other considerations: thread scheduling, mutexes, and other things will often be needed to properly schedule the coroutine's resumption, not to mention the time it takes to get the value from whatever asynchronous process is providing it. So the fact that a dynamic allocation is needed is not really a substantial consideration in this case.

Now, there are circumstances where in-situ allocation would be appropriate. Generator use cases are for when you want to essentially pause a function and return a value, then pick up where the function left off and potentially return a new value. In these scenarios, the stack for the function which invokes the coroutine will certainly still be around.

co_await supports such scenarios (though co_yield), but it does so in a less-than-optimal way, at least in terms of the standard. Because the feature is designed for up-and-out suspension, turning it into a suspend-down coroutine has the effect of having this dynamic allocation that doesn't need to be dynamic.

This is why the standard does not require dynamic allocation; if a compiler is smart enough to detect a generator pattern of usage, then it can remove the dynamic allocation and just allocate the space on the local stack. But again, this is what a compiler can do, not must do.

In this case, alloca-based allocation would be appropriate.

How it got into the standard

The short version is that it got into the standard because the people behind it put in the work, and the people behind the alternatives did not.

Any coroutine idea is complicated, and there will always be questions about implementability with regard to them. For example, the "resumeable functions" proposals looked great, and I would have loved to see it in the standard. But nobody actually implemented it in a compiler. So nobody could prove that it was actually a thing you could do. Oh sure, it sounds implementable, but that doesn't mean it is implementable.

Remember what happened the last time "sounds implementable" was used as the basis for adopting a feature.

You don't want to standardize something if you don't know it can be implemented. And you don't want to standadize something if you don't know if it actually solves the intended problem.

Gor Nishanov and his team at Microsoft put in the work to implement co_await. They did this for years, refining their implementation and the like. Other people used their implementation in actual production code and seemed quite satisfied with its functionality. Clang even implemented it. As much as I personally don't like it, it is undeniable that co_await is a mature feature.

By contrast, the "core coroutines" alternatives that were brought up a year ago as competing ideas with co_await failed to gain traction in part because they were difficult to implement. That's why co_await was adopted: because it was a proven, mature, and sound tool that people wanted and had the demonstrated ability to improve their code.

co_await is not for everyone. Personally, I will likely not use it much, as fibers work much better for my use cases. But it is very good for its specific use case: up-and-out suspension.

Curious - why do you say fibers work much better for your usecases? Is it that you usually preallocate them so it's much cheaper to use local state within the fiber with all the optimizations around function calls in C++ compilers? (like RVO, NRVO, stack variable locality, etc) — Curious, Aug 23 '19 at 06:54
@Curious: My usecases involve task-level programming, rather than single functions. It is useful to be able to suspend the entire task, from anywhere within the task, and resume it later. All without any code in the middle having to know about the suspension. It's not about allocation; it's about what is being done. — Nicol Bolas, Aug 23 '19 at 13:28

score 10 · Answer 3 · edited Jul 23 '20 at 02:16

10

stackless coroutines

stackless coroutines (C++20) do code transformation (state machine)
stackless in this case means, that the application stack is not used to store local variables (for instance variables in your algorithm)
otherwise the local variables of the stackless coroutine would be overwritten by invocations of ordinary functions after suspending the stackless coroutine
stackless coroutines do need memory to store local variables too, especially if the coroutine gets suspended the local variables need to be preserved
for this purpose stackless coroutines allocate and use a so-called activation record (equivalent to a stack frame)
suspending from a deep call stack is only possible if all functions in between are stackless coroutines too (viral; otherwise you would get a corrupted stack)
some clang developers are sceptical that the Heap Allocation eLision Optimization (HALO) can always be applied

stackful coroutines

in its essence a stackful coroutine simply switches stack and instruction pointer
allocate a side-stack that works like a ordinary stack (storing local variables, advancing the stack pointer for called functions)
the side-stack needs to be allocated only once (can also be pooled) and all subsequent function calls are fast (because only advancing the stack pointer)
each stackless coroutines requires its own activation record -> called in a deep call chain a lot activation records have to be created/allocated
stackful coroutines allow to suspend from a deep call chain while the functions in between can be ordinary functions (not viral)
a stackful coroutine can outlive its caller/creator
one version of the skynet benchmarks spawns 1 million stackful coroutines and shows that stackful coroutines are very efficient (outperforming version using threads)
a version of the skynet benchmark using stackless coroutiens was not implemented yet
boost.context represents the thread's primary stack as a stackful coroutine/fiber - even on ARM
boost.context supports on demand growing stacks (GCC split stacks)

edited Jul 23 '20 at 02:16

Nicol Bolas

449,505
63
781
982

answered Jul 24 '19 at 07:11

xlrg

1,994
1
16
14

4

"*a stackless coroutine can not outlive its caller/creator*" Yes they can. That's the point of them. – Nicol Bolas Jul 24 '19 at 13:47
1

"*the activation record must not reside on the the thread's primary stack*" This is also untrue. Depending on how the stackless coroutine is used, the "activation record" may indeed live on the stack for that thread. – Nicol Bolas Jul 24 '19 at 13:52
1

@Nicol Nolas: only if the compiler can proof that the activation record on the thread's stack could not be overwritten – xlrg Jul 24 '19 at 13:54
1

You said "must not", which means that it is forbidden. If it can be allowed in certain circumstances, then "must not" is wrong. – Nicol Bolas Jul 24 '19 at 13:55
right - but it is only possible in rare circumstances – xlrg Jul 24 '19 at 13:56
How rare that is depends on the user's codebase and the quality of the compiler. You have no basis for suggesting that this would be sufficiently rare to give people the impression that it is *impossible*. And if you do have a basis for that, then put it in your answer. Otherwise, you're just spreading disinformation. – Nicol Bolas Jul 24 '19 at 14:03
it's based on a statement of a clang developer – xlrg Jul 24 '19 at 14:04
Then put that statement in your answer. And that statement had *better* be something more than "can't be done in every case". Because that doesn't mean it can't be done in many cases, or simple cases, or generator cases, or whatever. Unless the Clang developer says that it *cannot happen* or that it can only happen in vanishingly rare circumstances, your statement is wrong. – Nicol Bolas Jul 24 '19 at 14:04
Also, what about my first point, where you make a patently untrue statement about stackless coroutines not being able to outlive their caller/creator? – Nicol Bolas Jul 24 '19 at 14:10
for HALO to work the compiler has to prove that no handle to the stack/frame leaves the scope of the caller/coroutine-creator. this is only applicable if the compiler is able to inline all coroutine methods – xlrg Jul 24 '19 at 14:17
2

Right. What's your point? If you're only using one coroutine, and it's inline (because it's a simple generator), and you're not passing it around anywhere... then your code satisfies this condition. And this is *exactly* the condition where elision is warranted and useful. The point of HALO is not to elide *every* coroutine; it's to allow elision where it is useful. In particular, generator scenarios. – Nicol Bolas Jul 24 '19 at 14:19

Are stackless C++20 coroutines a problem?

3 Answers3

On dynamic allocation

How it got into the standard

Linked