Which HDFS operations are atomic?

Question

I am trying to write code to import files into HDFS for use as a hive external table. I have found that using something like:

foo | ssh hostname "hdfs dfs -put - /destination/$FILENAME"

can cause a type of error where a temporary file is created and then renamed when complete. This can cause a race condition for hive between a directory listing and query execution.

One workaround is to copy to a temporary directory and "hdfs dfs mv" the file into position.

The specific and general/academic questions are:

The "hdfs dfs -mv" command is atomic, right?
What other HDFS commands or operations are atomic?
Can two "hdfs dfs -mkdir" commands issued at approximately the same time believe they both succeeded?
Is there better way to avoid race conditions with hive when moving files into position?

score 14 · Accepted Answer · answered Jan 07 '16 at 15:53

In Hadoop FS introduction you can find requirements for atomicity

Here are the core expectations of a Hadoop-compatible FileSystem. Some FileSystems do not meet all these expectations; as a result, some programs may not work as expected.

Atomicity

There are some operations that MUST be atomic. This is because they are often used to implement locking/exclusive access between processes in a cluster.

Creating a file. If the overwrite parameter is false, the check and creation MUST be atomic.

Deleting a file.

Renaming a file.

Renaming a directory.

Creating a single directory with mkdir().

...

Most other operations come with no requirements or guarantees of atomicity.

So to be sure you must check underlying filesystem. But based on those requirements answers are:

yes
listed above
no
imho renaming a file is good choice for the job

This does not seem to line up with [`FileSystem`'s documentation for `rename`](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#rename-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Options.Rename...-), which mentions it is not atomic by default. — Colonel Thirty Two, Sep 25 '19 at 20:50
is dependent on the file system implementation, but yeah, by default it is not atomic — tworec, Sep 26 '19 at 21:23

Which HDFS operations are atomic?

1 Answers1