1

From what I understand Bronze table in Delta Lake architecture represents the raw and (more or less) unmodified data in a table format. Does this mean that I also shouldn't partition the data for the Bronze table? You could see partitioning as something that depends on the use case, which points to Silver or even Gold table.

Look at this example:

def read():
    return spark.read\
        .format("csv")\
        .option("delimiter", "\t")\
        .option("header", True)\
        .load("file.tsv.gz")

table_name = "file"
location = f"/mnt/storage/{table_name}"

read().write.partitionBy("something").format("delta").save(location)

spark.sql(f"CREATE TABLE {table_name} USING DELTA LOCATION '{location}/'")

Notice the partitionBy("something"). Does this belong in a Bronze table?

trallnag
  • 2,041
  • 1
  • 17
  • 33

1 Answers1

1

Generally speaking I would recommend not partitioning by a predicate in the bronze layer. You should use OPTIMIZE to maintain 'right-sized' files for subsequent reads without introducing additional bias in how the data is organized in storage.

Partitioning and Z-Ordering can speed up reads by improving data skipping. Implicit in your choice of predicate to partition by, however, is some business logic. This can introduce a form of bias to your data and can have unintended downstream effects in your pipelines. The concept of 'bronze' is to simply land the data in the lake as it is, with as little changed as possible.

There are likely exceptions to this, but these are the general considerations.

Raphael K
  • 2,265
  • 1
  • 16
  • 23
  • I see, so this is similar to my understanding of it. – trallnag Dec 15 '20 at 13:04
  • 1
    I'm discussing this with teammates at Databricks right now so I'll likely update the answer with some other thoughts. Stay tuned. – Raphael K Dec 15 '20 at 13:28
  • Hi @RaphaelK, could you please elaborate a bit more? Like, what "unintended downstream effects" can cause by partitioning the bronze layer? I am looking for suggestions and practices. – Spacez Feb 02 '22 at 10:01
  • Sure. If you partition by clientID vs. date, then the performance of downstream queries of the bronze layer will depend on which predicate they use in the where clause. For example, if all of your downstream queries have a date in the where clause, then partitioning by date will give you the best performance. Partitioning by client ID will result in poorer data skipping and thus, longer query times. If you always query by date, then partitioning by date makes sense but how do you know that you'll always query by date? Therefore, don't partition bronze just rely on 'right sized' files. – Raphael K Feb 09 '22 at 17:07