How to enable lineage-based fault tolerance for Spark-Tachyon integration?

Question

I am trying to implement RDD/Dataframe sharing using Tachyon. It is my understanding that with HDFS underFS, writes are asynchronous (with replication to HDFS happening behind the scene) and therefore should be faster but in my testing I see that Tachyon with HDFS underFS is 2-6 times slower at writing.

From this Tachyon paper I see that:

"We made [lineage-based fault tolerance] configurable in our Spark and MapReduce integration"

How do you enable Spark to use lineage-based fault tolerance in Tachyon?

Note: I am using the Spark Dataframe method, df.write.parquet, and the RDD method, rdd.saveAsObjectFile, to save my Dataframes/RDDs to Tachyon.

zero323 · Answer 1 · 2015-12-11T14:23:24.083

You should set tachyon.user.lineage.enabled to true and adjust other lineage settings according to your preferences. Some of the most interesting settings (from the Master Configuration docs):

tachyon.master.lineage.checkpoint.interval.ms - The interval (in milliseconds) between Tachyon's checkpoint scheduling.

tachyon.master.lineage.checkpoint.class - The class name of the checkpoint strategy for lineage output files. The default strategy is to checkpoint the latest completed lineage, i.e. the lineage whose output files are completed.

tachyon.master.lineage.recompute.interval.ms - The interval (in milliseconds) between Tachyon's recompute execution. The executor scans the all the lost files tracked by lineage, and re-executes the corresponding jobs. every 10 minutes.

See Lineage API docs for more details.

How to enable lineage-based fault tolerance for Spark-Tachyon integration?

1 Answers1