1

I'm trying to understand lazy evaluation in Apache spark. My understanding says:

Lets say am having Text file in hardrive.

Steps:

1) First I'll create RDD1, that is nothing but a data definition right now.(No data loaded into memory right now)

2) I apply some transformation logic on RDD1 and creates RDD2, still here RDD2 is data definition (Still no data loaded into memory)

3) Then I apply filter on RDD2 and creates RDD3 (Still no data loaded into memory and RDD3 is also an data definition)

4) I perform an action so that I could get RDD3 output in text file. So the moment I perform this action where am expecting output something from memory, then spark loads data into memory creates RDD1, 2 and 3 and produce output.

So laziness of RDDs in spark says just keep making the roadmap(RDDs) until they dont get the approval to make it or produce it live.

Is my understanding correct upto here...?

My second question here is, its said that its(Lazy Evaluation) one of the reason that the spark is powerful than Hadoop, May I know please how because am not much aware of Hadoop ? What happens in hadoop in this scenario ?

Thanks :)

Squeez
  • 343
  • 2
  • 3
  • 15
  • Does this answer your question? [Spark Transformation - Why is it lazy and what is the advantage?](https://stackoverflow.com/questions/38027877/spark-transformation-why-is-it-lazy-and-what-is-the-advantage) – JAdel Jan 18 '23 at 18:46

1 Answers1

1

Yes, your understanding is fine. A graph of actions (a DAG) is built via transformations, and they computed all at once upon an action. This is what is meant by lazy execution.

Hadoop only provides a filesystem (HDFS), a resource manager (YARN), and the libraries which allow you to run MapReduce. Spark only concerns itself with being more optimal than the latter, given enough memory

Apache Pig is another framework in the Hadoop ecosystem that allows for lazy evaluation, but it has its own scripting language compared to the wide programmability of Spark in the languages it supports. Pig supports running MapReduce, Tez, or Spark actions for computations. Spark only runs and optimizes its own code.

What happens in actual MapReduce code is that you need to procedurally write out each stage of an action to disk or memory in order to accomplish relatively large tasks

Spark is not a replacement for "Hadoop" it's a compliment.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245