0

Recently I had a chance to get to know the flink-table-store project. I was attracted by the idea behind it at the first glance.

After reading the docs, I've got a question in my head for a while. It's about the design of the file storage.

It looks it can implemented based on the other popular open-source libraries other than creating a total new component (lsm tree based). Hudi or iceburg looks like a good choice, since they both support change logs saving and querying. If do it like so, there is no need to create a component for other related computation engine (spark, hive or trinno) since they are already supported by hudi or iceburg. It looks like a better solution for me instead of create another wheel.

So, here is my questions. Is there any issue writing data as hudi or iceburg? Why not choose them in the first design decision?

Looking for design explanation.

1 Answers1

0

Flink Data Store is a new project created to natively support update/delete operations on DFS tables using data snapshots.

These features are already available in Apache Hudi, the first open lakehouse format, Delta Lake, the lakehouse format developed and maintained by Databricks and Apache Iceberg which evolve quickly.

The table created with these tools can be queried from different tools/engines (Spark, Flink, Trino, Athena, Spectrum, Dremio, ...), but to support all these tools, they do some changes on the design which can affect the performance, while Flink Data Store is created and optimized for Flink, so it gives you the best performance with Apache Flink comparing with the other 3 projects.

Is there any issue writing data as hudi or iceberg?

Not at all, a lot of companies use Hudi and Iceberg with Spark, Flink and Trino in production, and they have no issues.

Why not choose them in the first design decision?

If you want to create tables readable by the other tools, you should avoid using Flink Data Store, and you need to choose between the other options, but the main idea of Flink Data Store was to create internal tables used to transform your streaming data, which is similar for KTables in kafka stream, so you can write your streaming data to Flink Data Store tables, transform them on multiple stage, and at the end, write the result to Hudi or Iceberg table to query it by the different tools

Hussein Awala
  • 4,285
  • 2
  • 9
  • 23