spark persist MEMOERY_AND_DISK vs. Tachyon

Question

I want to make sure I understand tachyon. Is the use of Tachyon with hdfs under it more or less equivalent to to persisting RDD using MEMORY_AND_DISK. In both cases, when the amount of data over run the memory, they get bumped off to the hard drive.

I understand the performance difference due to jvm garbage collection. I am only asking about the over spill behavior.

score 1 · Accepted Answer · answered Feb 20 '17 at 16:39

The recommended way to persist RDDs in disk is to use local fs, not dfs -check SPARK_LOCAL_DIRS parameter-. this is because spark does not keep track of the data movements that dfs does. also, local fs is much faster than dfs since there is no replication etc...

in a cluster, tachyon has a potential to use other nodes memory for over spill, before writing the data to (d)fs. so, this is better if network + memory cost < disk cost.

In a single node, I dont think tachyon will bring any performance improvement other than removing gc overhead.

spark persist MEMOERY_AND_DISK vs. Tachyon

1 Answers1