Can I force a DAG dependency in Spark?

Question

Say I have a code like this, where x is an RDD.

val a = x.map(...)
val b = x.map(...)
val c = x.map(...)
....
// a, b and c are used later on

For some reason, I would like b only to execute after execution for a is completed, and c to only execute after execution for b is completed. Is there a way to force this dependency in Spark?

Secondly, what is the default mechanism for Spark to execute such code. Would it perform execution for a, b and c in parallel, since there is no dependency among them?

What is `x`? If they regular scala functors are going to be executed one after the other. — pedrofurla, Oct 24 '16 at 17:45
Is there any documentation? That type of stuff usually establishes how things will work. — Bill Woodger, Oct 24 '16 at 17:46
http://stackoverflow.com/a/31384084/2784721, yes those will be executed in parallel, you need to add action/terminal function at the end of map operation. Spark is lazy. btw what is that some reason. — Amit Kumar, Oct 24 '16 at 19:09

A Spoty Spot · Answer 1 · 2016-10-24T18:19:37.663

What I do is generally restructure my code so that when I call an action that forces evaluation of b it must be called after a. I've not seen a method of manipulating the DAG directly.

As for your second question. a,b and c will not be executed at all until an action. If the actions are called in parallel eg in Futures then they will be ran by spark in parallel. If they all called in sequence, spark will call them in Sequence

To clarify. If you call a.count(). That will block. Therefore actions forcing b and c cannot be executed so they will not run in parallel. However if you call something like:

Future(a.count())
Future(b.count())
Future(c.count())

Then the maps will happen in parallel. Each action will be called which will force evaluation of the stages. Then spark will process a tasks from each of those stages depending on how the total number of executor cores you have.

Of course I assume that a, b and c would be used later on. In that case, of course they will be executed. But the question is, would they be executed in parallel? — pythonic, Oct 24 '16 at 18:08

Can I force a DAG dependency in Spark?

1 Answers1