I am trying to implement RDD/Dataframe sharing using Tachyon. It is my understanding that with HDFS underFS, writes are asynchronous (with replication to HDFS happening behind the scene) and therefore should be faster but in my testing I see that Tachyon with HDFS underFS is 2-6 times slower at writing.
From this Tachyon paper I see that:
"We made [lineage-based fault tolerance] configurable in our Spark and MapReduce integration"
How do you enable Spark to use lineage-based fault tolerance in Tachyon?
Note: I am using the Spark Dataframe method, df.write.parquet
, and the RDD method, rdd.saveAsObjectFile
, to save my Dataframes/RDDs to Tachyon.