3

What is the consistency guarantees of Azure Data Lake Store? Has anyone found technical documentation describing it?

I am in particular interested in whether directory moves are atomic, whether directory listings are consistent, and whether files are read-after-write consistent.

1 Answers1

7

In Azure Data Lake Store, files have a read-after-write consistency (also sometimes referred to strong consistency). Directory listings are also strongly consistent.

Directory and file rename operations are atomic. This includes moving directories/files to a different parent. The only caveat to this behavior is when the destination of the rename operation already exists and the OVERWRITE option is used. In this condition the rename operation is not atomic. More information on the rename with OVERWRITE option is [located here](https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html#rename(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Options.Rename...)).

-Azure Data Lake Store Team

Amit Kulkarni
  • 910
  • 4
  • 11
  • 1
    Is there any public technical documentation on Azure Data Lake Store consistency, BTW? – Lars Albertsson Jan 19 '17 at 08:41
  • Not at the moment. Is there something specific besides the above that you were looking for? – Amit Kulkarni Jan 19 '17 at 18:05
  • I am interested in building batch job pipelines using a workflow manager, e.g. Luigi or Airflow. Storage consistency is relevant at handoff between one job and the next in a pipeline. In a typical scenario, reducers in a cluster computation framework, e.g. Spark, writes one file each to a temporary directory, and the master writes a marker file. The workflow manager then moves the directory to its final location. The workflow manager kicks off the next Spark job, which picks up the files written. – Lars Albertsson Jan 21 '17 at 20:49
  • In order for things to work reliably, the directory move needs to be atomic, the Spark master in the second job needs to get the full list of files when reading the directory, and the worker nodes need to be able to read the files. HDFS provides these guarantees, whereas cloud object storage services usually don't. Pipelines can survive if rare glitches cause jobs to fail, i.e. read-after-write sometimes fails, but not if directory moves are non-atomic or listings are inconsistent, since those scenarios cause silent data corruption. – Lars Albertsson Jan 21 '17 at 20:56
  • 2
    Unlike many cloud object stores, ADLS is a file system. It is designed to have many of these characteristics that are similar to HDFS such as atomic directory moves, consistent listings of directories and read after write consistency for files. We have tested and certified all of the workloads that run in Azure HDInsight including Spark with ADLS. So it should work in all scenarios. – Amit Kulkarni Jan 22 '17 at 22:45
  • @AmitKulkarni does that mean ADLS breaks the CAP theorem and achieves a CA system? – Gadam May 24 '21 at 01:58