How to have idempotent guarantee when writing spark dataset to hdfs?

Question

I have a spark process that writes to hdfs (parquet files). My guess is that by default, if spark has some failure and retry, it could write some files twice (am I wrong?).

But then, how should I do to get idempotence on the hdfs output?

I see 2 situations that should be questioned differently (but please correct me or develop it if you know better):

the failure happens while writing one item: I guess the writes is restarted, thus could write twice if the post on hdfs is not "atomic" w.r.t to spark writing call. What are the chances?
the failure happens in any place, but due to how the execution dag is made, the restart will happens in a task that is before several write tasks (I am thinking of having to restart before some groupBy for example), and some of these writes tasks were already done. Does spark execution guarantee that those tasks won't be called again?

badger · Answer 1 · 2021-01-12T07:05:33.380

I think it is depends on what kind of committer you use for your job and that committer is able to undo the failed job or not. for example when you using Apache Parquet-formatted output, Spark expects the committer Parquet to be a subclass of ParquetOutputCommitter. and if you use this committer DirectParquetOutputCommitter to for example appending data it is not able to undo the job. code

if you use ParquetOutputCommitter itself you can see that it extends FileOutputCommitter and overrides commitJob(JobContext jobContext) method a bit.

THE FOLLOWING CONTENTS ARE COPY/PASTE FROM Hadoop: The Definitive Guide

OutputCommitter API: The setupJob() method is called before the job is run, and is typically used to perform initialization. For FileOutputCommitter, the method creates the final output directory, ${mapreduce.output.fileoutputformat.outputdir}, and a temporary working space for task output, _temporary, as a subdirectory underneath it. If the job succeeds, the commitJob() method is called, which in the default file-based implementation deletes the temporary working space and creates a hidden empty marker file in the output directory called _SUCCESS to indicate to filesystem clients that the job completed successfully. If the job did not succeed, abortJob() is called with a state object indicating whether the job failed or was killed (by a user, for example). In the default implementation, this will delete the job’s temporary working space.

The operations are similar at the task level. The setupTask() method is called before the task is run, and the default implementation doesn’t do anything, because temporary directories named for task outputs are created when the task outputs are written.

The commit phase for tasks is optional and may be disabled by returning false from needsTaskCommit(). This saves the framework from having to run the distributed commit protocol for the task, and neither commitTask() nor abortTask() is called. FileOutputCommitter will skip the commit phase when no output has been written by a task.

If a task succeeds, commitTask() is called, which in the default implementation moves the temporary task output directory (which has the task attempt ID in its name to avoid conflicts between task attempts) to the final output path, ${mapreduce.output.fileoutputformat.outputdir}. Otherwise, the framework calls abortTask(), which deletes the temporary task output directory.

The framework ensures that in the event of multiple task attempts for a particular task, only one will be committed; the others will be aborted. This situation may arise because the first attempt failed for some reason — in which case, it would be aborted, and a later, successful attempt would be committed. It can also occur if two task attempts were running concurrently as speculative duplicates; in this instance, the one that finished first would be committed, and the other would be aborted.

How to have idempotent guarantee when writing spark dataset to hdfs?

1 Answers1