Confusion About Delta Lake

Question

I have tried to read a lot about databricks delta lake. From what I understand it adds ACID transactions to your data storage and accelerated query performance with a delta engine. If so, why do we need other data lakes which do not support ACID transactions? Delta lakes claims to combine both worlds of data lakes and data warehouse, we know that it can not replace a traditional data warehouse yet due to its current support of operations. But should it replace data lakes? Why the need to have two copies of data - one in data lake and one in delta lake?

Delta Lake is a type of data lakes. Do you mean some specific data lake product when saying data lake? — zsxwing, Aug 02 '20 at 00:14
hi, yes, I mean will delta lake replace other data lakes without these sort of capabilities of ACID etc, like amazon s3, azure blob storage etc? — , Aug 02 '20 at 08:17
Some people may call cloud storages such as amazon s3, azure blob storage as data lakes. But in my opinion, they are storages more similar to file systems in the single machine world. Delta Lake is actually built on top of them to store the raw files and metadata. Questions like this usually get opinion-based answers and are discouraged by Stack Overflow. It's better to ask this in the project's mailing list, such as https://groups.google.com/forum/#!forum/delta-users — zsxwing, Aug 02 '20 at 16:58

score 2 · Answer 1 · answered Dec 20 '22 at 00:01

Delta Lake is a type of lake house. Other examples of lake houses include Hudi and Iceberg.

A lake house is a tool that manages the deta lake in an efficient way and support ACID transactions and advanced features like data versioning.

The question should be - "Is there any benefit by using a pure data lake over a lake house?"

I guess the best advantage of a pure data lake is that it's OOTB, therefore cheaper/less complex than using a lake house, which provides you some advantages that you don't always need.

score 0 · Accepted Answer · answered Aug 06 '20 at 20:21

Delta Lake is a product (like Redshift) rather than a concept/approach/theory (like dimensional modelling). As with any product in any walk of life, some of the claims made for the product will be true and some will be marketing spin. Whether the claimed benefits for a product actually make it superior to an alternative product will change from use case to use case.

Asking why there are other data lake solutions besides Delta Lake is a bit like asking why there is more than one DBMS in the world.

Actually delta lake isn't a product it's a open standard built on the parquet open standard. Redshift can choose to support it. Databricks contributes to is so spark supports it. — Brian, Feb 02 '22 at 22:08

score 0 · Answer 3 · answered Sep 09 '20 at 00:46

0

In my personal case there was already a data lake, a sybase IQ but its performance is poor compared to the queries that I can perform through spark to delta, speed is an important factor, and in partitioned tables it is remarkable

answered Sep 09 '20 at 00:46

Cristián Vargas Acevedo

580
11
16

score 0 · Answer 4 · answered Feb 02 '22 at 23:44

Delta lake is an open standard. Acid transactions are in reference to writes that fail midway. Transactions are a safety mechanism. Core support is in spark but other tools have added support for Delta lake. Delta lake is not a product. There is also the lake house design which again isn't a product but a way to approach building a data lake. If you follow the principles you can use any technology.

Confusion About Delta Lake

4 Answers4