2

I'm viewing my job in the Spark Application Master console and I can see in real-time the various stages completing as Spark eats its way through my application's DAG. It all goes reasonably fast. Some stages take less than a second, others take a minute or two.

enter image description here

The final stage, at the top of the list, is rdd.saveAsTextFile(path, classOf[GzipCodec]). This stage takes a very long time.

I understand that transformations are executed zero, one or many times depending upon the execution plan created as a result of actions like saveAsTextFile or count.

As the job progresses, I can see the execution plan in the App Manager. Some stages are not present. Some are present more than once. This is expected. I can see the progress of each stage in realtime (as long as I keep hitting F5 to refresh the page). The execution time is roughly commensurate with the data input size for each stage. Because of this, I'm certain that what the App Manager is showing me is the progress of the actual transformations, and not some meta-activity on the DAG.

So if the transformations are occurring in each of those stages, why is the final stage - a simple write to S3 from EMR - so slow?

If, as my colleague suggest, the transformation stages shown in the App Manager are not doing actual computation, what are they doing that consumes so much memory, CPU & time?

Synesso
  • 37,610
  • 35
  • 136
  • 207

2 Answers2

4

In Spark lazy evaluation is a key concept, and a concept you'd better get familiar with if you want to work with Spark.

The stages you witness to complete too fast do not do any significant computation.

If they are not doing actual computation, what are they doing?

They are updating the DAG.

When an action is triggered, then Spark has the chance to consult the DAG to optimize computation (something that wouldn't be possible without lazy optimization).

For more, read Spark Transformation - Why its lazy and what is the advantage?

Moreover, I think your colleague rushed to give you an answer, and mistakenly said:

transformation are cheap

The truth lies in ref's RDD operations:

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

Cheap is not the right word.

That explains why, in the end of the day, the final stage of yours (that actually asks for data and triggers the action) is so slow in comparison with the other tasks.

I mean every stage you mention does not seem to trigger any action. As a result, the final stage has to take into account all of the prior stages, and do all the work needed, but remember, in an optimized Spark-viewpoint.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Thank you. What still confuses me is that stages are compiled Spark functions calling my compiled functions. Given that many of them take minutes to complete, is Spark simply doing optimisations? Or is there extra work going on? – Synesso Jun 01 '18 at 10:15
  • @Synesso I do not understand the question. Especially the first line. – gsamaras Jun 01 '18 at 10:16
  • For example, when App Manager shows a stage took 3 minutes to complete, what happened during that 3 minutes? Only optimisation? – Synesso Jun 01 '18 at 10:17
  • There is no optimization happening that that stage I would say @Synesso. When a transformation is applied, the DAG is updated accordingly. When this is done, then the stage is done. When another stage, that applies an action, is executed, then Spark will check the DAG in order to see what is going all, to have an overview, and then do its optimizations. – gsamaras Jun 01 '18 at 10:20
  • That's my original understanding. I'm still confused. Perhaps I can add some code to the question. – Synesso Jun 01 '18 at 10:22
  • With what @Synesso? My last comment should explain why the stage that triggers the action is so more computationally costly than the stages that trigger transformations. The one just updates a graph, while the other *actually* performs the computation that you asked for. – gsamaras Jun 01 '18 at 10:24
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/172233/discussion-between-synesso-and-gsamaras). – Synesso Jun 01 '18 at 10:28
3

I guess the real confusion is here:

transformation are cheap

Transformations are lazy (most of the time), but nowhere near cheap. It means transformation won't be applied, unless there is an eager descendant (action) depending on it. It doesn't tell you anything about its cost.

In general transformations are places, where the real work happens. Output actions, excluding storage / network IO, are the ones who are usually cheap, compared to the logic executed in transformations.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • This is my understanding too. And yet, the write to S3 stage was heinously slow, despite running on EMR. – Synesso Jun 01 '18 at 09:58
  • S3 is in general incredibly slow and there is of course a matter of commit algorithm. – Alper t. Turker Jun 01 '18 at 11:25
  • 1
    writing work to S3 is slow because that commit algorithm uses renames, which are slow and unreliable against S3. If you aren't using "consistent EMR", worry about the possibility of corrupt output rather than the performance. – stevel Jun 01 '18 at 18:21