Unfortunately it does seem to load rdd1 and rdd2 twice. I was hoping it doesn't (the commenters really got my hopes high, thanks Soumya for mentioning narrow dependencies, I'll try to see if I can refactor my code to take advantage of this somehow). I assume this might be something that future versions of Spark will have optimizations to eliminate the dual loading, but currently it doesn't seem to do so.
Here is a simple experiment that proves it: (the TriMap
and the AtomicInteger
are just for illustration purposes since it's running locally, it won't work on a cluster AFAIK, even though both are Serializable :), in any case, it's plain to see the files are loaded twice without it. this is just a cherry on top, but just seeing the println
s shows each file rdd being computed twice)
Explanation on what we see. This is just an elaboration of the code in the question. I create 2 file RDDs, do branching transformations on them, (map on each,then join etc), then build a Graph that is built upon these RDDs, cache it (it's cached by default, but added an explicit call just to make it more readable)
Then I call graph.triplets.collect
which loads the entire RDD DAG.
(Env: Spark 1.2.1 Scala 2.11.5 Windows 7 64 bit)
The files I used were very small, only 2 partitions, so the println shows that each file was loaded twice (we see each partition + index combination appearing twice)
**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1
it should have looked like this
**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1
The full test code:
// scalastyle:off
import java.util.concurrent.atomic.AtomicInteger
import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, JobConf}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext._
import org.apache.spark.graphx.{Edge, Graph}
import org.apache.spark.rdd.{HadoopRDD, RDD}
import org.apache.spark.{InterruptibleIterator, Partition, SerializableWritable, SparkContext, TaskContext}
import scala.collection.concurrent.TrieMap
object CacheTest {
// I think this only works when running locally ;) but still helps prove the point
val numFileWasRead = TrieMap[String, AtomicInteger]()
def main(args: Array[String]) {
Logger.getRootLogger.setLevel(Level.WARN)
val sc = new SparkContext("local[4]", "Cache Test") {
override def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] =
super.textFile(path, minPartitions)
override def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions
): RDD[(K, V)] = {
// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
val confBroadcast = broadcast(new SerializableWritable(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions) {
override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {
val index = theSplit.index
if(index == 0) {
numFileWasRead.getOrElseUpdate(path, new AtomicInteger(0)).incrementAndGet()
}
println(s"\r**** read file: path: $path partition index: $index")
val bytesRead = context.taskMetrics().inputMetrics.map(metrics =>
println(metrics.bytesRead))
super.compute(theSplit, context)
}
}.setName(path)
}
}
val vFileName = "c:/temp/vertices.txt"
val eFileName = "c:/temp/edges.txt"
val rdd1 = sc.textFile(vFileName)
val rdd2 = sc.textFile(eFileName)
val a = rdd1.map(x => {
val xLong = x.toLong
xLong -> xLong * 2
})
val b = rdd1.map(x => {
val xLong = x.toLong
xLong -> xLong * 2
})
val c = for {
row <- rdd2
Array(left, _) = row.split(" ")
} yield {
left.toLong
}
sc.setJobGroup("mapping rdd2 to d", "")
val d = for {
row <- rdd2
Array(_, right) = row.split(" ")
} yield {
right.toLong
}
val vertices = a.join(b).map(x => x._1 -> "foo")
val edges = c zip d map {
case (left, right) => Edge(left, right, "N/A")
}
val graph = Graph(vertices, edges) // graph is automatically caching vertices and edges
graph.cache() //these is a futile call, just in case you don't believe me (look at Graph's source...)
val rdds = List[RDD[_]](rdd1, rdd2, a, b, c, d, vertices, edges, graph.vertices, graph.edges, graph.triplets)
val rddsNames = List("rdd1", "rdd2", "a", "b", "c", "d", "vertices", "edges", "graph.vertices", "graph.edges", "graph.triplets")
val rddNameById = (rdds zip rddsNames).map(x => x._1.id -> x._2).toMap
def printCachedInformation(intro: String): Unit = {
println("\n\n" + intro.toUpperCase + "\n\n")
def displayRDDName(id: Int): String = {
rddNameById.getOrElse(id, "N/A") + s"(" + id + ")"
}
println("sc.getPersistentRDDs: \n" + sc.getPersistentRDDs.map(x => {
val id = x._1
displayRDDName(id) -> x._2
}).mkString("\n"))
val storageInfo = sc.getRDDStorageInfo
val storageInfoString = if (storageInfo.isEmpty) " Empty "
else storageInfo.map(x => {
val id = x.id
displayRDDName(id) -> x
}).mkString("\n")
println("sc.getRDDStorageInfo: \n" + storageInfoString)
}
printCachedInformation("before collect")
println("\n\nCOLLECTING...\n\n")
graph.triplets.collect()
printCachedInformation("after collect")
//subsequent calls to collect will take it from the Graph's cache so no point in continuing
println("\n\nSUMMARY\n\n")
for((file, timesRead) <- numFileWasRead) {
println(s"file: $file was read ${timesRead.get()} times")
}
}
}
The output
BEFORE COLLECT
sc.getPersistentRDDs:
(N/A(23),VertexRDD, VertexRDD ZippedPartitionsRDD2[23] at zipPartitions at VertexRDD.scala:296)
(N/A(26),EdgeRDD MapPartitionsRDD[26] at mapPartitions at EdgeRDDImpl.scala:108)
(N/A(16),EdgeRDD, EdgeRDD MapPartitionsRDD[16] at mapPartitionsWithIndex at EdgeRDD.scala:104)
sc.getRDDStorageInfo:
Empty
COLLECTING...
**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/vertices.txt partition index: 1
**** read file: path: c:/temp/vertices.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1
**** read file: path: c:/temp/edges.txt partition index: 0
**** read file: path: c:/temp/edges.txt partition index: 1
AFTER COLLECT
sc.getPersistentRDDs:
(N/A(23),VertexRDD, VertexRDD ZippedPartitionsRDD2[23] at zipPartitions at VertexRDD.scala:296)
(N/A(26),EdgeRDD MapPartitionsRDD[26] at mapPartitions at EdgeRDDImpl.scala:108)
(N/A(16),EdgeRDD, EdgeRDD MapPartitionsRDD[16] at mapPartitionsWithIndex at EdgeRDD.scala:104)
sc.getRDDStorageInfo:
(N/A(23),RDD "VertexRDD, VertexRDD" (23) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 2; TotalPartitions: 2; MemorySize: 3.0 KB; TachyonSize: 0.0 B; DiskSize: 0.0 B)
(N/A(26),RDD "EdgeRDD" (26) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 2; TotalPartitions: 2; MemorySize: 5.5 KB; TachyonSize: 0.0 B; DiskSize: 0.0 B)
(N/A(16),RDD "EdgeRDD, EdgeRDD" (16) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 2; TotalPartitions: 2; MemorySize: 5.5 KB; TachyonSize: 0.0 B; DiskSize: 0.0 B)
SUMMARY
file: c:/temp/edges.txt was read 2 times
file: c:/temp/vertices.txt was read 2 times
Process finished with exit code 0
The input
edges.txt
1 2
2 3
3 4
4 1
2 5
5 6
1 3
3 6
6 1
1 7
7 8
8 4
8 9
9 10
10 11
11 12
12 9
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
20 18
19 17
vertices.txt
1
2
3
4
2
5
1
3
6
1
7
8
8
9
10
11
12
12
13
14
15
16
17
18
19
20
19