Practical Delimited Continuations in C / x64 ASM

Question

I've look at a paper called A Primer on Scheduling Fork-Join Parallelism with Work Stealing. I want to implement continuation stealing, where the rest of the code after calling spawn is eligible to be stolen. Here's the code from the paper.

1 e();
2 spawn f(); 
3 g();
4 sync;
5 h();

An import design choice is which branch to offer to thief threads. Using Figure 1, the choices are:

Child Stealing:

f() is made available to thief threads.

The thread that executed e() executes g().

Continuation Stealing:

Also called “parent stealing”.

The thread that executed e() executes f().

The continuation (which will next call g()) becomes available to thief threads.

I hear that saving a continuation requires saving both sets of registers (volatile/non-volatile/FPU). In the fiber implementation I did, I ended up implementing child stealing. I read about the (theoretical) negatives of child stealing (unbounded number of runnable tasks, see the paper for more info), so I want to use continuations instead.

I'm thinking of two functions, shift and reset, where reset delimits the current continuation, and shift reifies the current continuation. Is what I'm asking even plausible in a C environment?

EDIT: I'm thinking of making reset save return address / NV GPRs for the current function call (= line 3), and making shift transfer control to the next continuation after returning a value to the caller of reset.

Absolutely it's possible. The article references some tools/libraries. See https://www.openmp.org/resources/openmp-compilers-tools/ — jwdonahue, Jun 28 '18 at 00:41
@jwdonahue Any whitepapers I should be looking at? I'd like to see what others have done already. I haven't found much. Putting a bounty on this for an authoritative answer/reference. — Jesse Lactin, Jul 01 '18 at 20:01
what is sense of all this ? what you really want/need implement ? — RbMm, Jul 01 '18 at 20:12

Ira Baxter · Accepted Answer · 2018-07-06T15:03:40.143

I've implemented work stealing for a HLL called PARLANSE rather than C on an x86. PARLANSE is used daily to build production symbolic parallel programs at the million line scale.

In general, you have preserve the registers of both the continuation or the "child". Consider that your compiler may see a computation in f() and see the same computation in g(), and might lift that computation to the point just before the spawn, and place that computation result in a register that both f() and g() use as in implied parameter. Yes, this assumes a sophisticated compiler, but if you are using a stupid compiler that doesn't optimize, why are you trying to go parallel for speed?

In specific, however, your compiler could arrange for the registers to be empty before the call to spawn if it understood what spawn means. Then neither the continuation or the child has to preserve registers. (The PARLANSE compiler in fact does this).

So how much has to be saved depends on how much your compiler is willing to help, and that depends on whether it knows what spawn really does.

Your local friendly C compiler likely doesn't know about your implementation of spawn. So either you do something to force a register flush (don't ask me, its your compiler) or you put up with the fact that you personally don't know what's in the registers, and your implementation preserves them all to be safe.

If the amount of work spawned is significant, arguably it wouldn't matter if you saved all the registers. However, the x86 (and other modern architectures) seems have an enormous amount of state, mostly in the vector registers, that might be in use; last time I looked it was well in excess of 500 bytes ~~ 100 writes to memory to save these and IMHO that's an excessive price. If you don't believe these registers are going to be passed from the parent thread to the spawned thread, then you can work on enforcing spawn with no registers.

If you spawn routine wakes up using a standard continuation mechanism you have invented, then you have worry about whether your continuations pass large register state or not, also. Same problem, same solutions as for spawn; the compiler has to help or you personally have to intervene.

You'll find this a lot of fun.

[If you want to make it really interesting, try timeslicing the threads in case they go into deep computation without an occasional yeild causing thread starvation. Now you surely have save the entire state. I managed to get PARLANSE to realize spawning with no registers saved, yet have the time slicing save/restore full register state, by saving full state on a time slice, and continuing at a special place that refilled all the registers before it passed control to the time-sliced PC location].

If you can make `spawn` look to your compiler like a non-inline function call, you can skip saving all the call-clobbered registers. (Like you'd do for a context-switch function in a kernel or user-space threads). That includes all the x87 regs, but on x64 Windows xmm6..15 are call-preserved along with many of the integer regs. (Only the xmm part, though, not the YMM/ZMM upper lanes. You can save/restore with non-VEX `movaps xmm`.) Or if you really need to save/restore the full FPU state, there's `xsaveopt` that can get MXCSR and the x87 status reg. — Peter Cordes, Jul 02 '18 at 11:20
Good point. So there are two issues: knowing at the spawn site which registers don't need to be saved (or as I have suggested, conning your compiler into making that set as small as possible by forcing it to spill) and knowing which registers to *restore* when the spawned function/child function. If that set is function-call-site specific then your thread switcher gets more complicated; otherise how does it know? Or you can setlle on a scheme which defines a constant set of registers to save/restore, and implement with that. — Ira Baxter, Jul 02 '18 at 12:44

Practical Delimited Continuations in C / x64 ASM

1 Answers1