0

We're performing some tests to evaluate the behavior of transformations and actions in Spark with Spark SQL. In our tests, first we conceive a simple dataflow with 2 transformations and 1 action:

LOAD (result: df_1) > SELECT ALL FROM df_1 (result: df_2) > COUNT(df_2) 

The execution time for this first dataflow was 10 seconds. Next, we added another action to our dataflow:

LOAD (result: df_1) > SELECT ALL FROM df_1 (result: df_2) > COUNT(df_2) > COUNT(df_2) 

Analyzing the second version of the dataflow, since all transformation are lazy and re-executed for each action (according to the documentation), when executing the second count, it should require the execution of the two previous transformations (LOAD and SELECT ALL). Thus, we expected that when executing this second version of our dataflow, the time would be around 20 seconds. However, the execution time was 11 seconds. Apparently, the results of the transformations required by the first count were cached by Spark for the second count.

Please, do you guys know what is happening?

David
  • 11,245
  • 3
  • 41
  • 46
Brccosta
  • 39
  • 1
  • 6
  • Can you also provide the DAG from Spark UI? – Anton Okolnychyi Dec 09 '16 at 12:23
  • The TLDR is that If the DF wasn't garbage collected in time spark would be able to carry out the action on meaning you didn't need to recompute the entire DAG, so technically you would skip the stage of Loading / Selecting the data as that is already materialised in memory. – Lyuben Todorov Dec 09 '16 at 12:34

2 Answers2

1

Take a look at your jobs, you may see skipped stages which is a good thing. Spark recognizes that it still has the shuffle output from the previous job and will reuse it rather than starting from the source data and re-shuffle the full dataset.

Silvio
  • 3,947
  • 21
  • 22
0

It is the Spark DAG scheduler which recolonizes that there is future use of data after it get it from Action.A Spark program implicitly creates a logical directed acyclic graph (DAG) of operations.When the driver runs, it converts this logical graph into a physical execution plan.

Actions force translation of the DAG to an execution plan

When you call an action on an RDD it must be computed.In your Case you are just doing an action and after that doing another action on top of that. This requires computing its parent RDDs as well. Spark’s scheduler submits a job to compute all needed RDDs. That job will have one or more stages, which are parallel waves of computation composed of tasks. Each stage will correspond to one or more RDDs in the DAG. A single stage can correspond to multiple RDDs due to pipelining.

Spark Visualization

DAG

Indrajit Swain
  • 1,505
  • 1
  • 15
  • 22