0

Delta file format:

Nowadays different file formats with respect to data processing are becoming popular. One of them is Delta format developed and open sourced by Databricks.

Its most important feature is ACID (others being - support for upsert/delete and data governance).

My question is:

Does this file system act like a server - a process keeps running and responds to requests?
[Analogous to Hadoop file system, where the base file system is Unix File System and on top of it, HDFS operates where name node manages the HDFS files and responds to file system requests].

But if this is to be true (Delta format has processes running), I don't see any server process described in Delta file format in any articles.

So, what is responsible for its features (ACID, upsert, delete, data governance etc..) ?

To add on, it is told that many other tools can interact with Delta file format.
For example: DBT (a SQL based transformation tool) can read/write data. If this is the case, which process is responsible to provide the aforementioned features?

Also it is mentioned that Delta format supports only tables. If yes, is it a RDBMS product?

Just I am trying to understand at which level this file format is operating.

For HDFS, it very clear that it operates on top of host OS file system and different processes (name node, data node etc) are available to interact with it. Similarly I don't get any clarity about the Delta format.

Any help will be much appreciated.

Thanks

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
user3103957
  • 636
  • 4
  • 16
  • From a quick reading of the documentation it seems that Apache Spark is responsible for responding to requests. Delta is loaded as a dependency/extension in Spark, so if there is a request that should be handled by Delta then Spark will call Delta executables to perform the operations. The data is stored [as Parquet files](https://docs.delta.io/latest/delta-faq.html#id3) on the storage system. See also https://docs.delta.io/latest/delta-storage.html. – Marijn Jun 25 '23 at 12:46
  • Thanks Marjin! In that case, Spark acts as a server? It is also mentioned that other tools can interact with Delta Lake (eg: Data Build Tool - DBT). So requests from DBT needs to hit Spark and then spark responds back? This prompts more questions like how requests from other tools get translated into Spark code? etc... – user3103957 Jun 25 '23 at 13:53

1 Answers1

1

Delta Lake by itself just a file format that allows to build many features on top of it. And data is stored in some storage (cloud or on-premise). It still requires some processes to accept data processing commands and execute them. That could be done different ways:

  • Apache Spark & tools built on top of it, like, Databricks SQL Warehouse. That was initial use case for Delta, and most popular as of right now. Often, 3rd party tools are integrated with Apache Spark via ODBC/JDBC

  • Specialized connectors, like, for Trino, PrestoDB, ... allows to work with Delta Lake tables.

  • Rust & Python APIs allows to work with Delta tables without Apache Spark.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132