-1

In my batch processing data pipeline I have transactions with booking date and accounting date, the transactions in the same time window have the same booking date and within 2 mins time window, booking date is just several minutes earlier than processing time in my data pipeline while accounting date could be earlier or later than booking date.

When querying these transactions accounting date is always in the SQL condition. So I think accounting date should be the partition key.

But when I think of writing part I'm not sure anymore. Is it better to write to more partitions(less hotspot?) or less partitions?

Is it better to use booking date or accounting date as partition key? and why?

user1532146
  • 184
  • 2
  • 14

1 Answers1

1

For read path, you are correct, partitioning on a widely used column will boost your reads.

Now for the write path (I assume you are upserting a COW table layout), a general rule is the less parquet file volume you rewrite, the fastest it will be. So in your case, the less partition you modify, the better.

As a side note, the target parquet file size is an other trade of: for faster writes, reduce it, and for faster reads, increase it. The reason is half explained in the above rule. The other part is parallelism: for small files, multiple executors can split the write work. Not the case with large files.

parisni
  • 920
  • 7
  • 20