Bad Performance using Open Table Format

Question

I have an existing case:

where Entire/Full data is read daily from multiple hive tables, Which is processed/transformed (join, aggregation, filter, etc) as mentioned in SQL query.

These SQL query are mentioned in series of YAML files , let's say there are 3 YAML file.

each result is saved to a temp view and subsequent SQL query uses the previous generated temp view until the last SQL from last YAMl is executed,

The final result is written to HIVE TABLE in format : hive_table_name_epoch.

Data Volume: ~400,000,000 (400 Million)

Data Size: ~ 400 Gib

Execution Time: ~25 min

New use case: Trying to integrate Open Table Format (Delta lake or Apache Hudi). Aim: is to maintain only one hive table instead of creating hive table each day and to avoid writing entire data again and again, Instead write only required updates, new inserts and deletes

Case 1: Using Delta Lake

Data Volume: ~400,000,000 (400 Million)

Data Size: ~ 400 Gib

Execution Time: ~ 1.10 hr

Case 2: Using Apache Hudi Data Volume: ~400,000,000 (400 Million)

Data Size: ~ 400 Gib

Execution Time: ~ 40 min

Question: Still the older concept of processing and writing entire data daily performs far better, than using open table format.

Is the existing case really feasibly to integrate with incremental concept ?

May be if we some how make our data CDC , then it would make sense.

Getting to the bottom of this would be an interesting exercise in performance tuning. Do you also need to take into account read performance? It seems to me that Databricks is really taking delta tables forward in leaps and bounds right now. Looking at this page which is over 6 months old, it seems like a lot of the delta table comments might not be valid anymore https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison — Nick.Mc, Aug 14 '23 at 02:23
Yes imple,menting CDC is a good idea considering that both Delta Tables and Hudi can do merges (not just "append only"). You can take advantage of the features that are slowing you down — Nick.Mc, Aug 14 '23 at 02:34
you don't mention which Delta version you are using, only that it's oss. It may be worth looking at the 3.0.0 preview given it's merge and deletion vector improvements. The improvements made each release aren't seemingly backported to older Spark versions so you'd have to look at upgrading Spark as well.... — Chris, Aug 14 '23 at 05:45
@Chris currently , I'm using spark version - v2.4.8 OS image - 1.5 Debian10 Delta Lake - 0.6.0 I agree to your point and I am upgrading spark and delta version to 3.x.x & 2.0.0 — Rituparno Behera, Aug 14 '23 at 20:06

parisni · Answer 1 · 2023-08-14T11:19:21.770

In general those acid libraries do data modification by rewriting the underlying files aka copy on write (except hudi merge on read/delta deletion vector).

Then an important aspect to consider is how much of the incoming data matches the target table primary keys. If it's 50%, then you will at least rewrite 50% of the files. Likely you end up rewriting all the files because the matching pks are probably spread on each files.

As a result, there is a threshold to choose between merge design and replace the whole table. By the way, those format brings you acid capabilities such no downtime and time travel, so they would benefit in any case.

As for hudi you can tune the target parquet file size to speedup the data modification, as it leads to rewrite less volume in some cases. But smaller file size brings slower read queries.

An other aspect is definitely target file size. Those format usually produce larger files than traditional hive tables to optimize reads (refer to small files problem). But again this comes with drops in write performances since less writes parallelism.

As a conclusion, while the lakehouse paradigm embraces SQL, it's a bit more complicated than traditional data warehouses technologies: you have to understand how it works, and analyses your workload to choose the best path, otherwise you can get dramatical performances drops during merges.

Delta Lake has merge on read as well – Alex Ott Aug 14 '23 at 09:48 — Alex Ott, Aug 14 '23 at 09:48

Bad Performance using Open Table Format

1 Answers1