Spark Transformation - Why is it lazy and what is the advantage?

Question

Spark Transformations are lazily evaluated - when we call the action it executes all the transformations based on lineage graph.

What is the advantage of having the Transformations Lazily evaluated?

Will it improve the performance and less amount of memory consumption compare to eagerly evaluated?

Is there any disadvantage of having the Transformation lazily evaluated?

Sachin Tyagi · Accepted Answer · 2016-06-25T12:32:03.220

For transformations, Spark adds them to a DAG of computation and only when driver requests some data, does this DAG actually gets executed.

One advantage of this is that Spark can make many optimization decisions after it had a chance to look at the DAG in entirety. This would not be possible if it executed everything as soon as it got it.

For example -- if you executed every transformation eagerly, what does that mean? Well, it means you will have to materialize that many intermediate datasets in memory. This is evidently not efficient -- for one, it will increase your GC costs. (Because you're really not interested in those intermediate results as such. Those are just convnient abstractions for you while writing the program.) So, what you do instead is -- you tell Spark what is the eventual answer you're interested and it figures out best way to get there.

score 12 · Answer 2 · answered Sep 11 '16 at 16:05

Consider a 1 GB log file where you have error,warning and info messages and it is present in HDFS as blocks of 64 or 128 MB(doesn't matter in this context).You first create a RDD called "input" of this text file. Then,you create another RDD called "errors" by applying filter on the "input" RDD to fetch only the lines containing error messages and then call the action first() on the "error" RDD. Spark will here optimize the processing of the log file by stopping as soon as it finds the first occurrence of an error message in any of the partitions. If the same scenario had been repeated in eager evaluation, Spark would have filtered all the partitions of the log file even though you were only interested in the first error message.

score 9 · Answer 3 · edited Sep 11 '16 at 16:49

From https://www.mapr.com/blog/5-minute-guide-understanding-significance-apache-spark

Lazy evaluation means that if you tell Spark to operate on a set of data, it listens to what you ask it to do, writes down some shorthand for it so it doesn’t forget, and then does absolutely nothing. It will continue to do nothing, until you ask it for the final answer. [...]

It waits until you’re done giving it operators, and only when you ask it to give you the final answer does it evaluate, and it always looks to limit how much work it has to do.

It saves time and unwanted processing power.

score 1 · Answer 4 · edited Nov 16 '22 at 16:45

Consider When Spark is not Lazy..

For Example : we are having 1GB file loaded into memory from the HDFS We are having the transformation like

rdd1 = load file from HDFS
rdd1.println(line1)

In this case when the 1st line is executed entry would be made to the DAG and 1GB file would be loaded to memory. In the second line the disaster is that just to print the line1 of the file the entire 1GB file is loaded onto memory.

Consider When Spark is Lazy

rdd1 = load file from HDFS
rdd1.println(line1)

In this case 1st line executed anf entry is made to the DAG and entire execution plan is built. And spark does the internal optimization. Instead of loading the entire 1GB file only 1st line of the file loaded and printed..

This helps avoid too much of computation and makes way for optimization.

JAdel · Answer 5 · 2023-01-18T22:23:42.943

Advantages:

"Spark allows programmers to develop complex, multi-step data pipelines usind directed acyclic graph (DAG) pattern" - [Khan15]
"Since spark is based on DAG, it can follow a chain from child to parent to fetch any value like traversal" - [Khan15]
"DAG supports fault tolerance" - [Khan15]

Description:
(According to "Big data Analytics on Apache Spark" [SA16] and [Khan15])

"Spark will not compute RDDs until an action is called." - [SA16]
Example of actions:
reduce(func), collect(), count(), first(), take(n), ... [APACHE]
"Spark keeps track of the lineage graph of transformations, which is used to compute each RDD on demand and to recover lost data." - [SA16]
Example of transformations:
map(func), filter(func), filterMap(func), groupByKey([numPartitions]), reduceByKey(func, [numPartitions]), ... [APACHE]

Spark Transformation - Why is it lazy and what is the advantage?

5 Answers5

Linked