15

I am implementing a spark application, of which below is a sample snippet(Not the exact same code):

val rdd1 = sc.textfile(HDFS_PATH)
val rdd2 = rdd1.map(func)
rdd2.persist(StorageLevel.MEMORY_AND_DISK)
println(rdd2.count)

On checking the performance of this code from the Spark Application Master UI, I see an entry for the count action, but not for the persist. The DAG for this count action also has a node for the 'map' transformation (line 2 of the above code).

Is it safe to conclude that the map transformation is executed when count (in the last line) is encountered, and not when persist is encountered?

Also, at what point is rdd2 actually persisted? I understand that only two types of operations can be called on RDDs - transformations and actions. If the RDD is persisted lazily when the count action is called, would persist be considered a transformation or an action or neither?

Ankit Khettry
  • 997
  • 1
  • 13
  • 33

2 Answers2

47

Dataset's cache and persist operators are lazy and don't have any effect until you call an action (and wait till the caching has finished which is the extra price for having a better performance later on).

From Spark's official documentation RDD Persistence (with the sentence in bold mine):

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

That's exactly the reason why some people (and Spark SQL itself!) do the following trick:

rdd2.persist(StorageLevel.MEMORY_AND_DISK).count

to trigger the caching.

count operator is fairly cheap so the net effect is that the caching is executed almost immediately after the line (there might be a small delay before the caching has completed as it executes asynchronously).

The benefits of this count after persist are as follows:

  1. No action (but the count itself) will "suffer" the extra time for caching

  2. The time between this line and the place where the cached rdd2 is used could be enough to fully complete the caching and hence the time would be used better (without extra "slowdown" for caching)

So when you asked:

would persist be considered a transformation or an action or neither?

I'd say it's neither and consider it an optimization hint (that may or may not be executed or taken into account ever).


Use web UI's Storage tab to see what Datasets (as their underlying RDDs) have already been persisted.

enter image description here

You can also see cache or persist operators' output using explain (or simply QueryExecution.optimizedPlan).

val q1 = spark.range(10).groupBy('id % 5).agg(count("*") as "count").cache
scala> q1.explain
== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [(id % 5)#120L, count#119L]
      +- InMemoryRelation [(id % 5)#120L, count#119L], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(2) HashAggregate(keys=[(id#0L % 5)#8L], functions=[count(1)])
               +- Exchange hashpartitioning((id#0L % 5)#8L, 200), true, [id=#13]
                  +- *(1) HashAggregate(keys=[(id#0L % 5) AS (id#0L % 5)#8L], functions=[partial_count(1)])
                     +- *(1) Range (0, 10, step=1, splits=16)

scala> println(q1.queryExecution.optimizedPlan.numberedTreeString)
00 InMemoryRelation [(id % 5)#5L, count#4L], StorageLevel(disk, memory, deserialized, 1 replicas)
01    +- *(2) HashAggregate(keys=[(id#0L % 5)#8L], functions=[count(1)], output=[(id % 5)#5L, count#4L])
02       +- Exchange hashpartitioning((id#0L % 5)#8L, 200), true, [id=#13]
03          +- *(1) HashAggregate(keys=[(id#0L % 5) AS (id#0L % 5)#8L], functions=[partial_count(1)], output=[(id#0L % 5)#8L, count#10L])
04             +- *(1) Range (0, 10, step=1, splits=16)

Please note that the count above is a standard function not an action and no caching happens. It's just a coincidence that count is the name of a standard function and an Dataset action.

You can cache a table using pure SQL (this is eager!)

// That registers range5 to contain the output of range(5) function
spark.sql("CACHE TABLE range5 AS SELECT * FROM range(5)")
val q2 = spark.sql("SELECT * FROM range5")
scala> q2.explain
== Physical Plan ==
*(1) ColumnarToRow
+- Scan In-memory table `range5` [id#51L]
      +- InMemoryRelation [id#51L], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(1) Range (0, 5, step=1, splits=16)

InMemoryTableScan physical operator (with InMemoryRelation logical plan) is how you can make sure that the query is cached in-memory and hence reused.


Moreover, Spark SQL itself uses the same pattern to trigger DataFrame caching for SQL's CACHE TABLE query (which, unlike RDD caching, is by default eager):

if (!isLazy) {
  // Performs eager caching
  sparkSession.table(tableIdent).count()
}

That means that depending on the operators you may have different result as far as caching is concerned. cache and persist operators are lazy by default while SQL's CACHE TABLE is eager.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • 3
    Thanks Jacek!! This gives great insights into the interiors of Spark that are great to know for all kinds of users. Your answers on SO are amazing :) And so are your blogs!! – Ankit Khettry Sep 25 '19 at 07:57
  • 3
    @AnkitKhettry Thanks for the kind words. Please upvote the answer if you like it (to keep my ego...ekhm...motivation high :)) – Jacek Laskowski Sep 25 '19 at 10:28
  • 1
    Haha!! Did it long back :) Was revisiting your answer recently when I came across something else – Ankit Khettry Sep 25 '19 at 10:42
  • Ah, right. You're the OP! But, hey, why don't you accept my answer then (given 15 to 2 scores)? – Jacek Laskowski Sep 25 '19 at 11:45
  • At the time when I asked this question, I was fairly new to spark and David's answer helped me solve the problem *quickly*. Your answer gives a lot of insight now that I kind of have some intermediate knowledge, however seemed a bit verbose earlier. This is honest feedback. I have come across a lot of your videos, read a lot of your blogs and solved a lot of problems, thanks to these resources :) – Ankit Khettry Sep 25 '19 at 12:28
  • 1
    This could be a little unfair to David, and I would accept both answers if I could. But since its clear that this answer has helped more people, I accept yours! – Ankit Khettry Sep 25 '19 at 12:33
  • @JacekLaskowski Can you explain the benefits of this trick: rdd2.persist(StorageLevel.MEMORY_AND_DISK).count ? Even if we don't use this, and have multiple actions on an RDD/Dataframe/Dataset, the first time the action is called, caching will happen. – white-hawk-73 Jan 20 '20 at 10:41
  • 1
    you mentioned that "count operator is fairly cheap so the net effect is that the caching is executed almost immediately after the line" How does it matter if caching is executed immediately or not? Nevertheless, an excellent answer – white-hawk-73 Jan 20 '20 at 10:44
  • @ak0817 Benefits listed. They came to my mind immediately and it's not to say it's an exhaustive list of all the possible benefits. Ask away if you want to learn more! Thanks. – Jacek Laskowski Jan 20 '20 at 13:50
  • 1
    @JacekLaskowski Thanks for the great explaation but this line: `val q1 = spark.range(10).groupBy('id % 5).count.cache` seems to contradict what you mentioned about executing count AFTER persist/cache? – Jan33 Jul 06 '20 at 02:33
  • @Jan33 Thanks a lot for spotting this. It's a coincidence that the name `count` has two meanings in Spark SQL. I tried to elaborate on this in my answer. Let me know if you've got more questions (and this answer requires more). – Jacek Laskowski Jul 06 '20 at 15:45
4

Is it safe to conclude that the map transformation is executed when count (in the last line) is encountered, and not when persist is encountered?

Yes

Also, at what point is rdd2 actually persisted?

The data is read, mapped, and persisted all at once while executing the count statement

would persist be considered a transformation or an action or neither?

It's not really either, but in terms of the processing work done, you can consider it like a transformation. Spark is lazy and will only do work when you ask for a result. No result is required when you persist a data frame, so Spark does no work. In that way, persist is like a transformation

David
  • 11,245
  • 3
  • 41
  • 46