Why is my for loop of cilk_spawn doing better than my cilk_for loop?

Question

I have

cilk_for (int i = 0; i < 100; i++)
   x = fib(35);

the above takes 6.151 seconds

and

for (int i = 0; i < 100; i++)
   x = cilk_spawn fib(35);

takes 5.703 seconds

The fib(x) is the horrible recursive Fibonacci number function. If I dial down the fib function cilk_for does better than cilk_spawn, but it seems to me that regardless of the time it takes to do fib(x) cilk_for should do better than cilk_spawn.

What don't I understand?

Something that has to do with recursion and stack overload I guess. — 101010, Apr 23 '14 at 21:31
Does putting "#pragma cilk grainsize = 1" in front of the cilk_for help? Without it, the heuristics in the cilk_for may be chunking the loop sub-optimally. How many cores does your machine have? — Arch D. Robison, Apr 24 '14 at 18:19
its an i7 so 4 cores. the cilk_for is exactly the same speed with grainsize = 1. — d0m1n1c, Apr 24 '14 at 22:19
One trap I fell into while playing with this example is forgetting to put a cilk_spawn after the for loop. Without that, the loop can appear to finish before all the work is finished. Though if you are timing the whole program, that shouldn't make a difference since main ends in an implicit cilk_sync. — Arch D. Robison, Apr 24 '14 at 22:44
Well, that was certainly the issue, trapped indeed. I added cilk_sync after the for loop and now they have near identical times. I ran it again with fib(20) and for i < 1000 and cilk_for performed about ten times better than the for of spawns. — d0m1n1c, Apr 24 '14 at 23:08

Arch D. Robison · Answer 1 · 2017-09-14T02:37:32.023

Per comments, the issue was a missing cilk_sync. I'll expand on that to point out exactly how the ratio of time can be predicted with surprising accuracy.

On a system with P hardware threads (typically 8 on a i7) for/cilk_spawn code will execute as follows:

The initial thread will execute the iteration for i=0, and leave a continuation that is stolen by some other thread.
Each thief will steal an iteration and leave a continuation for the next iteration.
When each thief finishes an iteration, it goes back to step 2, unless there are no more iterations to steal.

Thus the threads will execute the loop hand-over-hand, and the loop exits at a point where P-1 threads are still working on iterations. So the loop can be expected to finish after evaluating only (100-P-1) iterations.

So for 8 hardware threads, the for/cilk_spawn with missing cilk_sync should take about 93/100 of the time for the cilk_for, quite close to the observed ratio of about 5.703/6.151 = 0.927.

In contrast, in a "child steal" system such as TBB or PPL task_group, the loop will race to completion, generating 100 tasks, and then keep going until a call to task_group::wait. In that case, forgetting the synchronization would have led to a much more dramatic ratio of times.

I can't understand "(typically 7 on a i7)"! Is it correct? – IndustProg Sep 10 '17 at 05:15 — IndustProg, Sep 10 '17 at 05:15

Why is my for loop of cilk_spawn doing better than my cilk_for loop?

1 Answers1