Memory usage for transformation on RDD's in alluxio/tachyon for spark

Question

Lets say we create an RDD from alluxio memory

rdd1 = sc.textFile("alluxio://.../file1.txt")
rdd2 = rdd1.map(...)

Does rdd2 reside on alluxio or on spark's heap.

Also would an operation like (both pairRDD's on alluxio) pairRDD1.join(pairRDD2) create a new RDD on alluxio or on spark heap.

The reason for the second question is that I need to join 2 large RDD's both on alluxio. Would the join use alluxio's memory or would the RDD's get pulled into spark memory for the join (and where would the resulting RDD reside).

- The output of `map` is written to OS BUFFER CACHE. - The operating system will decide if the data can stay in OS buffer cache or should it be spilled to DISK. — RoyaumeIX, Jun 09 '16 at 08:06

score 2 · Accepted Answer · answered Jun 24 '16 at 14:16

Spark transformations are evaluated in a lazy fashion. That means map() will not be evaluated until a result is required, and will not consume any Spark memory. An RDD will only consume Spark memory if you explicitly call cache() on the RDD.

Therefore, when you are joining 2 RDDs from Alluxio, only the source data of the RDDs will be memory, in Alluxio. During the join, Spark will use the memory required to execute the join.

Where the resulting RDD resides depends on what you are doing with that RDD. If you are writing the resulting RDD out to a file, that RDD will not be fully materialized in Spark memory, but will be written out to the file. If that file is in Alluxio, it would be in Alluxio memory, and not Spark memory. The resulting RDD will only be in Spark memory if you explicitly call cache().

Memory usage for transformation on RDD's in alluxio/tachyon for spark

1 Answers1