1

I want to make sure I understand tachyon. Is the use of Tachyon with hdfs under it more or less equivalent to to persisting RDD using MEMORY_AND_DISK. In both cases, when the amount of data over run the memory, they get bumped off to the hard drive.

I understand the performance difference due to jvm garbage collection. I am only asking about the over spill behavior.

dtolnay
  • 9,621
  • 5
  • 41
  • 62
bhomass
  • 3,414
  • 8
  • 45
  • 75

1 Answers1

1

The recommended way to persist RDDs in disk is to use local fs, not dfs -check SPARK_LOCAL_DIRS parameter-. this is because spark does not keep track of the data movements that dfs does. also, local fs is much faster than dfs since there is no replication etc...

in a cluster, tachyon has a potential to use other nodes memory for over spill, before writing the data to (d)fs. so, this is better if network + memory cost < disk cost.

In a single node, I dont think tachyon will bring any performance improvement other than removing gc overhead.

semihsahin
  • 38
  • 1
  • 6