I am trying to write code to import files into HDFS for use as a hive external table. I have found that using something like:
foo | ssh hostname "hdfs dfs -put - /destination/$FILENAME"
can cause a type of error where a temporary file is created and then renamed when complete. This can cause a race condition for hive between a directory listing and query execution.
One workaround is to copy to a temporary directory and "hdfs dfs mv" the file into position.
The specific and general/academic questions are:
- The "hdfs dfs -mv" command is atomic, right?
- What other HDFS commands or operations are atomic?
- Can two "hdfs dfs -mkdir" commands issued at approximately the same time believe they both succeeded?
- Is there better way to avoid race conditions with hive when moving files into position?