I think it is depends on what kind of committer you use for your job and that committer is able to undo the failed job or not. for example
when you using Apache Parquet-formatted output, Spark expects the committer Parquet to be a subclass of ParquetOutputCommitter
. and if you use this committer DirectParquetOutputCommitter
to for example appending data it is not able to undo the job. code
if you use ParquetOutputCommitter
itself you can see that it extends FileOutputCommitter
and overrides commitJob(JobContext jobContext)
method a bit.
THE FOLLOWING CONTENTS ARE COPY/PASTE FROM Hadoop: The Definitive Guide
OutputCommitter API:
The setupJob() method is called before the job is run, and is typically used to perform
initialization. For FileOutputCommitter
, the method creates the final output directory,
${mapreduce.output.fileoutputformat.outputdir}
, and a temporary working space
for task output, _temporary, as a subdirectory underneath it.
If the job succeeds, the commitJob() method is called, which in the default file-based
implementation deletes the temporary working space and creates a hidden empty marker
file in the output directory called _SUCCESS to indicate to filesystem clients that the job
completed successfully. If the job did not succeed, abortJob() is called with a state object
indicating whether the job failed or was killed (by a user, for example). In the default
implementation, this will delete the job’s temporary working space.
The operations are similar at the task level. The setupTask() method is called before the
task is run, and the default implementation doesn’t do anything, because temporary
directories named for task outputs are created when the task outputs are written.
The commit phase for tasks is optional and may be disabled by returning false from
needsTaskCommit(). This saves the framework from having to run the distributed commit
protocol for the task, and neither commitTask()
nor abortTask()
is called.
FileOutputCommitter
will skip the commit phase when no output has been written by a
task.
If a task succeeds, commitTask()
is called, which in the default implementation moves the
temporary task output directory (which has the task attempt ID in its name to avoid
conflicts between task attempts) to the final output path,
${mapreduce.output.fileoutputformat.outputdir}
. Otherwise, the framework calls
abortTask()
, which deletes the temporary task output directory.
The framework ensures that in the event of multiple task attempts for a particular task,
only one will be committed; the others will be aborted. This situation may arise because
the first attempt failed for some reason — in which case, it would be aborted, and a later,
successful attempt would be committed. It can also occur if two task attempts were
running concurrently as speculative duplicates; in this instance, the one that finished first
would be committed, and the other would be aborted.