2

Say I have a code like this, where x is an RDD.

val a = x.map(...)
val b = x.map(...)
val c = x.map(...)
....
// a, b and c are used later on

For some reason, I would like b only to execute after execution for a is completed, and c to only execute after execution for b is completed. Is there a way to force this dependency in Spark?

Secondly, what is the default mechanism for Spark to execute such code. Would it perform execution for a, b and c in parallel, since there is no dependency among them?

pythonic
  • 20,589
  • 43
  • 136
  • 219

1 Answers1

1

What I do is generally restructure my code so that when I call an action that forces evaluation of b it must be called after a. I've not seen a method of manipulating the DAG directly.

As for your second question. a,b and c will not be executed at all until an action. If the actions are called in parallel eg in Futures then they will be ran by spark in parallel. If they all called in sequence, spark will call them in Sequence

To clarify. If you call a.count(). That will block. Therefore actions forcing b and c cannot be executed so they will not run in parallel. However if you call something like:

Future(a.count())
Future(b.count())
Future(c.count())

Then the maps will happen in parallel. Each action will be called which will force evaluation of the stages. Then spark will process a tasks from each of those stages depending on how the total number of executor cores you have.

A Spoty Spot
  • 746
  • 6
  • 19
  • Of course I assume that a, b and c would be used later on. In that case, of course they will be executed. But the question is, would they be executed in parallel? – pythonic Oct 24 '16 at 18:08