How Apache Spark caching works with regard to uncached file sources with non linear DAGs?

Question

Consider the following example

val rdd1 = sc.textFile(...)
val rdd2 = sc.textFile(...)

val a = rdd1.doSomeTransformation
val b = rdd1.doAnotherTransformation 

val c = rdd2.doSomeTransformation
val d = rdd2.doAnotherTransformation 

//nonsense code, just to illustrate that it's all part of a big DAG (or so I think)
val vertices = a.join(b)

val edges = c.join(d) //corrected (thanks Justin!)

val graph = new Graph(vertices, edges) //or something like this 

graph.cache()

graph.triplets.collect() // first "materialization"

graph.triplets.collect() // second "materialization"

My question is

If I don't cache rdd1 and rdd2, will they be reloaded twice each during the "first materialization"?

If I do cache them, then won't it kind of duplicate the data? Is there a way to temporarily cache the data? e.g. cache a partition, until the graph is cached, when the graph is fully cached, then unpersist the RDDs that created it. Is that possible?

EDIT: removed bloated verbosity and focused the question to a single topic.

As to your second question, why would the rdds be loaded twice each? What lines of code makes you think that? — Justin Pihony, Mar 09 '15 at 00:42
As @Justin said the RDDs won't be reloaded twice because all transformation are *lazy*, Spark creates a DAG of all the transformations and only materializes when an action is called. — Soumya Simanta, Mar 09 '15 at 01:48
@JustinPihony these lines: val a = rdd1.doSomeTransformation and val b = rdd1.doAnotherTransformation I'm not doing rdd.transformation1(..).transformation2 I'm doing rdd.transformation1 then rdd.transformation2 if rdd is not cached, it will have to reload, no? e.g. val a = rdd.map(_ * 2) ; val b = rdd.map(_ * 3); a.collect(); b.collect(); - if rdd is not cached, it will be loaded twice. or did I miss how spark works at all? — Eran Medan, Mar 09 '15 at 13:25
@SoumyaSimanta yes, I know it's lazy, but if you load a file into an RDD, and access the RDD more than once without caching it - once you call an action, won't it be loaded twice? Isn't that what caching is for? or am I missing something basic? e.g. `val rdd = sc.testFile("...") ; rdd.map(...).collect() ; rdd.filter(...).collect` -> this will load the file twice, won't it? to avoid it I need to do: `val rdd = sc.testFile("...") ; rdd.cache(); rdd.map(...).collect() ; rdd.filter(...).collect`-> this will only read the RDD once. correct? — Eran Medan, Mar 09 '15 at 13:30
@EranMedan if you do `val a = rdd.trans1` and then `val b = rdd.trans2` then Spark will not load `rdd` twice because both are *transformations*. Effectively spark can pipe line these two transformations to that they look like `rdd.trans1.trans2` — Soumya Simanta, Mar 09 '15 at 14:44
Sorry @SoumyaSimanta, but I don't understand how what you are saying is possible. an rdd is immutable. how can `rdd.trans1; rdd.trans2` be piped to `rdd.trans1.trans2`? they are forked... not piped. am I missing something? `val a = rdd.map(_ * 2) ; val b = rdd.map(_ * 3)` is not the same as `rdd.map(_ * 2).map(_ * 3)`. the first gives you two different rdds, let's say the original RDD was containing 1,2,3,4,5 - then I expect a to be 2,4,6,8,10; b to be 3,6,9,12,15. piping it will mean a single RDD with 6, 12, 18, 24, 30? You are saying that Spark can pipe these? can you please explain? — Eran Medan, Mar 09 '15 at 19:24
Please see slide 14, 15 and 16 here. Mainly about *narrow dependencies*. http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/slides/spark.pdf. The Spark scheduler can *pipeline* these together one each worker node because they are lazy. When an *action* is performed all these pipelined transformations are materialized. — Soumya Simanta, Mar 09 '15 at 23:29
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/72624/discussion-between-eran-medan-and-soumya-simanta). — Eran Medan, Mar 10 '15 at 00:33
@SoumyaSimanta I was really hoping you are right, and perhaps current Spark doesn't handle this specific case, but I understand your point. Please see my updated answer. — Eran Medan, Mar 10 '15 at 01:38
@JustinPihony - please see my updated answer, would love to hear your feedback (and whether I missed anything, I'm pretty new to Spark) — Eran Medan, Mar 10 '15 at 01:41
To the downvoter who removed the downvote - you are awesome, and I wish you long and prosperous life :) thanks for initially downvoting and forcing me to edit the question to be much less bloated... you have my respect, reversing a downvote is an honourable act. — Eran Medan, Mar 10 '15 at 02:01

Justin Pihony · Answer 1 · 2015-03-10T04:35:24.880

You are correct that this will run twice as the DAG would be something like this:

a = textFile1->doSomeTransformation
b = textFile1->doAnotherTransformation
c = textFile2->doSomeTransformation
d = textFile2->doAnotherTransformation
vertices = textFile1->doSomeTransformation | textFile1.doAnotherTransformation
edges = textFile2->doSomeTransformation | textFile2.doAnotherTransformation

Note that yes, there is commonality, but afaik Spark does not handle that when it comes to a join. SparkSQL might in the catalyst optimization portion....but I am very doubtful. Part of the reason for that would be that implicit caching of data could mess up memory storage calculations and evict cached data you expected to be there. Your best bet would be to rewrite it as follows:

val rdd1 = sc.textFile(...)
             .cache()
val rdd2 = sc.textFile(...)
             .cache()

val a = rdd1.doSomeTransformation
val b = rdd1.doAnotherTransformation 

val c = rdd2.doSomeTransformation
val d = rdd2.doAnotherTransformation 


val vertices = a.join(b)
val edges = c.join(a)
val graph = new Graph(vertices, edges) //or something like this 
graph.cache()

graph.triplets.collect() // first "materialization"
graph.triplets.collect() // second "materialization"

rdd1.unpersist()
rdd2.unpersist()

I will double check, but there should be no double caching as you are worried. The graph.cache will piggy-back off of the textFile caches.

Although, now that I can focus on the fact that you are NOT chaining, but instead performing different calculations, it is an interesting idea that could be turned on in a config or something. But, there are a lot of corner cases to such a feature (does it persist only for that DAG, or should it realize future calls might be made?). It would have to be something like: spark.optimization.cacheDAGCommonalities.

All that being said, if an RDD is "hot" I have seen it drop dramatically on subsequent requests (ie. textFile1 takes 10 min, but only 3-4 on the next iteration)

Thank you for this Justin. What I would hope Spark to do in the future, is for each partitions of rdd1 and rdd2, the minute it is transformed into the graph, get rid of it, but hold it until it's there. in your example, I had to first make sure that the graph is loaded before calling unpersist on the rdds. (If I did it before calling tripplets.collect, I would most likely unpersist something that was not yet eve cached, right) What I really want is `graph.onPartitionCacheLoaded { p=>p.dag.prev(recursive=true).unpersistPartition(p.partitionIndex) }` or something in that shape :) — Eran Medan, Mar 10 '15 at 05:10
p.s. from my little experience running on large clusters, and doing these transformation (and going to watch the executors memory in Spark's web view) - it seems that at least in 1.1.0 and earlier, it doesn't piggy back on caches well, for each step of the way I added caching, just in case, (I had a couple of TB of RAM on the EC2 cluster, I was lavish...) and it ended up from what it seems to just stack it one on top of the other without much grace... but I hope I'm wrong... — Eran Medan, Mar 10 '15 at 05:15

Eran Medan · Answer 2 · 2015-03-10T02:04:18.273

Unfortunately it does seem to load rdd1 and rdd2 twice. I was hoping it doesn't (the commenters really got my hopes high, thanks Soumya for mentioning narrow dependencies, I'll try to see if I can refactor my code to take advantage of this somehow). I assume this might be something that future versions of Spark will have optimizations to eliminate the dual loading, but currently it doesn't seem to do so.

Here is a simple experiment that proves it: (the TriMap and the AtomicInteger are just for illustration purposes since it's running locally, it won't work on a cluster AFAIK, even though both are Serializable :), in any case, it's plain to see the files are loaded twice without it. this is just a cherry on top, but just seeing the printlns shows each file rdd being computed twice)

Explanation on what we see. This is just an elaboration of the code in the question. I create 2 file RDDs, do branching transformations on them, (map on each,then join etc), then build a Graph that is built upon these RDDs, cache it (it's cached by default, but added an explicit call just to make it more readable)

Then I call graph.triplets.collect which loads the entire RDD DAG.

(Env: Spark 1.2.1 Scala 2.11.5 Windows 7 64 bit)

The files I used were very small, only 2 partitions, so the println shows that each file was loaded twice (we see each partition + index combination appearing twice)


**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1

it should have looked like this


**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1

The full test code:

// scalastyle:off

import java.util.concurrent.atomic.AtomicInteger

import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, JobConf}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext._
import org.apache.spark.graphx.{Edge, Graph}
import org.apache.spark.rdd.{HadoopRDD, RDD}
import org.apache.spark.{InterruptibleIterator, Partition, SerializableWritable, SparkContext, TaskContext}

import scala.collection.concurrent.TrieMap

object CacheTest {

  // I think this only works when running locally ;) but still helps prove the point
  val numFileWasRead = TrieMap[String, AtomicInteger]()

  def main(args: Array[String]) {
    Logger.getRootLogger.setLevel(Level.WARN)

    val sc = new SparkContext("local[4]", "Cache Test") {
      override def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] =
        super.textFile(path, minPartitions)


      override def hadoopFile[K, V](
                                     path: String,
                                     inputFormatClass: Class[_ <: InputFormat[K, V]],
                                     keyClass: Class[K],
                                     valueClass: Class[V],
                                     minPartitions: Int = defaultMinPartitions
                                     ): RDD[(K, V)] = {

        // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
        val confBroadcast = broadcast(new SerializableWritable(hadoopConfiguration))
        val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
        new HadoopRDD(
          this,
          confBroadcast,
          Some(setInputPathsFunc),
          inputFormatClass,
          keyClass,
          valueClass,
          minPartitions) {
          override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {

            val index = theSplit.index
            if(index == 0) {
              numFileWasRead.getOrElseUpdate(path, new AtomicInteger(0)).incrementAndGet()
            }
            println(s"\r**** read file: path: $path partition index: $index")
            val bytesRead = context.taskMetrics().inputMetrics.map(metrics =>
              println(metrics.bytesRead))

            super.compute(theSplit, context)
          }
        }.setName(path)
      }
    }
    val vFileName = "c:/temp/vertices.txt"
    val eFileName = "c:/temp/edges.txt"
    val rdd1 = sc.textFile(vFileName)
    val rdd2 = sc.textFile(eFileName)

    val a = rdd1.map(x => {
      val xLong = x.toLong
      xLong -> xLong * 2
    })

    val b = rdd1.map(x => {
      val xLong = x.toLong
      xLong -> xLong * 2
    })

    val c = for {
      row <- rdd2
      Array(left, _) = row.split(" ")
    } yield {
      left.toLong
    }

    sc.setJobGroup("mapping rdd2 to d", "")

    val d = for {
      row <- rdd2
      Array(_, right) = row.split(" ")
    } yield {
      right.toLong
    }

    val vertices = a.join(b).map(x => x._1 -> "foo")

    val edges = c zip d map {
      case (left, right) => Edge(left, right, "N/A")
    }
    val graph = Graph(vertices, edges) // graph is automatically caching vertices and edges
    graph.cache() //these is a futile call, just in case you don't believe me (look at Graph's source...)

    val rdds = List[RDD[_]](rdd1,    rdd2,   a,   b,   c,   d,   vertices,   edges,   graph.vertices,   graph.edges,  graph.triplets)
    val rddsNames =    List("rdd1", "rdd2", "a", "b", "c", "d", "vertices", "edges", "graph.vertices", "graph.edges", "graph.triplets")
    val rddNameById = (rdds zip rddsNames).map(x => x._1.id -> x._2).toMap

    def printCachedInformation(intro: String): Unit = {
      println("\n\n" + intro.toUpperCase + "\n\n")
      def displayRDDName(id: Int): String = {
        rddNameById.getOrElse(id, "N/A") + s"(" + id + ")"
      }
      println("sc.getPersistentRDDs: \n" + sc.getPersistentRDDs.map(x => {
        val id = x._1
        displayRDDName(id) -> x._2
      }).mkString("\n"))
      val storageInfo = sc.getRDDStorageInfo
      val storageInfoString = if (storageInfo.isEmpty) " Empty "
      else storageInfo.map(x => {
        val id = x.id
        displayRDDName(id) -> x
      }).mkString("\n")
      println("sc.getRDDStorageInfo: \n" + storageInfoString)
    }

    printCachedInformation("before collect")
    println("\n\nCOLLECTING...\n\n")
    graph.triplets.collect()
    printCachedInformation("after collect")
    //subsequent calls to collect will take it from the Graph's cache so no point in continuing

    println("\n\nSUMMARY\n\n")

    for((file, timesRead) <- numFileWasRead) {
      println(s"file: $file was read ${timesRead.get()} times")
    }

  }
}

The output



BEFORE COLLECT


sc.getPersistentRDDs: 
(N/A(23),VertexRDD, VertexRDD ZippedPartitionsRDD2[23] at zipPartitions at VertexRDD.scala:296)
(N/A(26),EdgeRDD MapPartitionsRDD[26] at mapPartitions at EdgeRDDImpl.scala:108)
(N/A(16),EdgeRDD, EdgeRDD MapPartitionsRDD[16] at mapPartitionsWithIndex at EdgeRDD.scala:104)
sc.getRDDStorageInfo: 
 Empty 


COLLECTING...


**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1


AFTER COLLECT


sc.getPersistentRDDs: 
(N/A(23),VertexRDD, VertexRDD ZippedPartitionsRDD2[23] at zipPartitions at VertexRDD.scala:296)
(N/A(26),EdgeRDD MapPartitionsRDD[26] at mapPartitions at EdgeRDDImpl.scala:108)
(N/A(16),EdgeRDD, EdgeRDD MapPartitionsRDD[16] at mapPartitionsWithIndex at EdgeRDD.scala:104)
sc.getRDDStorageInfo: 
(N/A(23),RDD "VertexRDD, VertexRDD" (23) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 2; TotalPartitions: 2; MemorySize: 3.0 KB; TachyonSize: 0.0 B; DiskSize: 0.0 B)
(N/A(26),RDD "EdgeRDD" (26) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 2; TotalPartitions: 2; MemorySize: 5.5 KB; TachyonSize: 0.0 B; DiskSize: 0.0 B)
(N/A(16),RDD "EdgeRDD, EdgeRDD" (16) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 2; TotalPartitions: 2; MemorySize: 5.5 KB; TachyonSize: 0.0 B; DiskSize: 0.0 B)


SUMMARY


file: c:/temp/edges.txt was read 2 times
file: c:/temp/vertices.txt was read 2 times

Process finished with exit code 0

The input

edges.txt

vertices.txt

How Apache Spark caching works with regard to uncached file sources with non linear DAGs?

2 Answers2