MapReduce atomic renames

Question

I was reading the map reduce paper here. The paper states that the reduce workers write their output to a temp file, which they then atomically rename to some reserved output file name, to indicate that the task is done. This is mentioned in Section 3.3(Semantics in the presence of failures).

But why does rename need to be atomic? This is my guess.

Let's say two reduce workers A, B are executing the same task.
Let the name of the final out file for this task be X.
Worker A starts renaming its temp file to X.
It's not atomic, so worker B starts renaming file.
Worker B renames its temp file to X.
Worker A finishes renaming temp file to X.
Messed up state?

If this is why we need an atomic rename, then I would like to know how rename works. Otherwise, I would like to know why we need an atomic rename.

score 1 · Answer 1 · answered Feb 07 '21 at 20:31

Not all filesystems provide atomic rename, some of Hadoop compatible filesystem implement rename operation as non-atomic cp + rm and eventually consistent and it creates complications when working with such filesystem.

GCS rename is not atomic:

Unlike the case with many file systems, the gsutil mv command does not perform a single atomic operation. Rather, it performs a copy from source to destination followed by removing the source for each object.

Rename in S3 is not atomic and not immediately consistent: Read Introduction to S3Guard

When renaming directories, the listing may be incomplete or out of date, so the rename operation loses files. This is very dangerous as MapReduce, Hive, Spark and Tez all rely on rename to commit the output of workers to the final output of the job.

HDFS provides atomic and consistent delete and rename but other Hadoop compatible filesystems may not completely support it.

Read this Apache Hadoop requirements of a Hadoop compatible filesystem

In the Atomicity section it is stated that rename file or directory MUST be atomic, but at the same time in the very beginning in Introduction you can read this:

The behaviour of other Hadoop filesystems are not as rigorously tested. The bundled S3 FileSystem makes Amazon’s S3 Object Store (“blobstore”) accessible through the FileSystem API. The Swift FileSystem driver provides similar functionality for the OpenStack Swift blobstore. The Azure object storage FileSystem in branch-1-win talks to Microsoft’s Azure equivalent. All of these bind to object stores, which do have different behaviors, especially regarding consistency guarantees, and atomicity of operations.

GCS, S3 and some other Hadoop-compatible filesystem do not provide atomicity for renames and it causing issues with Hive or Spark, though these issues more or less successfully can be fixed using other tools or technics like using S3Guard or creating new partition location each time based on timestamp/runId when rewriting partition and rely on atomic partition mount in Hive, etc, etc

Real world is not ideal. Mappers in Hadoop Mapreduce initially meant to be run if possible on data nodes where the data sits to speed-up processing, but companies like Amazon are selling computation clusters and storage separately. You can shutdown or resize one cluster, start another one and access the same data in S3, the data and computation are completely separated.

S3Guard (was) needed to deal with s3 list inconsistencies; that's no longer an issue. But even with consistent AWS the fact that rename isn't atomic means that the commit algorithms are unsafe (along with being really slow). This is why emrfs and s3a fs both have special committers which rely on file upload being atomic, and it being possible to postpone completing the upload until job commit — stevel, Feb 08 '21 at 19:11
@stevel What do you mean saying: that's no longer an issue? IMO delete still is not consistent. therefore update is not consistent. Try to delete thousands of files and list + read all and you will see... FileNotFound exception. Or am I missing something ? — leftjoin, Feb 08 '21 at 20:21
This doesn't explain why atomic renames are used in that section of the paper I described. As an aside, if I have a file with name X and some thread is renaming another file to X atomically, and I'm reading from the original X, is it guaranteed that I keep reading from the original X? — Arjun Nair, Feb 08 '21 at 21:15
@ArjunNair What you are describing is not atomicity. It is transaction Isolation. Atomic operation is operation which cant be executed partially: if it fails, everything remains unchanged and another transaction should not see partial changes, But without Isolation, second transaction will see the result of atomic operation even if it started before atomic operation. To guarantee that you keep reading the same original file some locking mechanism (or versioning) is necessary to support transaction isolation. — leftjoin, Feb 09 '21 at 08:51
@ArjunNair HDFS provides atomic renames, other Hadoop compatible filesystem may not support atomicity, as a result, rename operation can be seen by other sessions as partially executed (old file deleted, new file does not exist). Isolation is another thing. Filesystem and MapReduce alone do not care about isolation, some other tools should be used to support isolation. Read also this answer: https://stackoverflow.com/a/63378038/2700344 — leftjoin, Feb 09 '21 at 09:06
@leftjoin: I mean that is used to be if you added a file, then LIST may not see it. Delete it and LIST may still show it. So the act of listing a directory and copying each one didn't always work. Look for spark JIRAs on that topic — stevel, Feb 10 '21 at 18:36
@leftjoin so what is "no longer an issue" w.r.t s3guard is no need to worry about list inconsistencies or 404 caching, and nobody has to fear update inconsistencies (which we couldn't handle). No atomicity of create(overwrite=false), rename file, rename dir, delete dir tree, so any commit algorithm which relies on those to coordinate/exclude workers is doomed to fail. Fortunately, on S3 rename is so slow people notice and complain about that first — stevel, Feb 10 '21 at 18:39

MapReduce atomic renames

1 Answers1