The reason that it happens is due to the how Cache works in spark.
When you call some kind of process to a DataFrame, RDD or DataSet the execution has a plan see below:
val df = sc.parallelize(1 to 10000).toDF("line")
df.withColumn("new_line", col("line") * 10).queryExecution
The command queryExecution
return to you the plan. See the logical plan below of the code:
== Parsed Logical Plan ==
Project [*,('line * 10) AS new_line#7]
+- Project [_1#4 AS line#5]
+- LogicalRDD [_1#4], MapPartitionsRDD[9] at
== Analyzed Logical Plan ==
line: int, new_line: int
Project [line#5,(line#5 * 10) AS new_line#7]
+- Project [_1#4 AS line#5]
+- LogicalRDD [_1#4], MapPartitionsRDD[9] at
== Optimized Logical Plan ==
Project [_1#4 AS line#5,(_1#4 * 10) AS new_line#7]
+- LogicalRDD [_1#4], MapPartitionsRDD[9] at intRddToDataFrameHolder at
== Physical Plan ==
Project [_1#4 AS line#5,(_1#4 * 10) AS new_line#7]
+- Scan ExistingRDD[_1#4]
In this case you can see all the process that your code will do. When you call a cache
function like this:
df.withColumn("new_line", col("line") * 10).cache().queryExecution
The result will be like this:
== Parsed Logical Plan ==
'Project [*,('line * 10) AS new_line#8]
+- Project [_1#4 AS line#5]
+- LogicalRDD [_1#4], MapPartitionsRDD[9] at intRddToDataFrameHolder at <console>:34
== Analyzed Logical Plan ==
line: int, new_line: int
Project [line#5,(line#5 * 10) AS new_line#8]
+- Project [_1#4 AS line#5]
+- LogicalRDD [_1#4], MapPartitionsRDD[9] at intRddToDataFrameHolder at <console>:34
== Optimized Logical Plan ==
InMemoryRelation [line#5,new_line#8], true, 10000, StorageLevel(true, true, false, true, 1), Project [_1#4 AS line#5,(_1#4 * 10) AS new_line#8], None
== Physical Plan ==
InMemoryColumnarTableScan [line#5,new_line#8], InMemoryRelation [line#5,new_line#8], true, 10000, StorageLevel(true, true, false, true, 1), Pro...
This execution will return to you the execution of an InMemoryRelation
in optmized logical plan, this will save a structure of data in your Memory, or if your data is really big it will spills to the Disk.
The time to save this in your cluster take time, it will be a little bit slow in the first execution, but when you need to access again the same data in other place the DF or the RDD will be saved and the Spark will not request the execution again.