3

I want to merge a dataframe which contains incremental data to my base deltatable. I want to achieve partition pruning while doing it to avoid scanning a lot of files.

My data is partitioned based on a date field creating the year, month partitions. For the example we can imagine that in the base table I have partitions for the full 2021 and 2022 year and all months, but in the incremental dataframe I only have data for year=2022/month=2 and year=2022/month=3.

My first approach was performing the merge adding the partitioned fields in the on condition of the merge:

     deltaTable
     .alias("base")
     .merge(pagesBatch.alias('inc'), "(base.year=inc.year and base.month=inc.month) and base.id=inc.id")
     .whenNotMatchedInsertAll()
     .whenMatchedUpdate('base.updated_at <= inc.updated_at', set=updateSet)
     .execute()
     

But after doing some test I saw that after the merge all partitions files were being read. (I checked the last access file times in all the partitions)

However, if I first extract the partitions in the incremental dataframe and I add it using an in clause, it only access the needed files:

     # Partition list extraction code is omitted for simplification.
     deltaTable
     .alias("base")
     .merge(pagesBatch.alias('inc'), "concat(base.year, base.month) in ('202202,'202201') and base.id=inc.id")
     .whenNotMatchedInsertAll()
     .whenMatchedUpdate('base.updated_at <= inc.updated_at', set=updateSet)
     .execute()

The second approach adds some complexity to the code and it is always needed a previous step to extract the partitions to add in the in clause, which would be better to avoid. I also have seen some answers to similar questions which propose using the base.field = inc.field condition. For example here:

How to specify nested partitions in merge query while trying to merge incremental data with a base table?

Any help clarifying how to better approach it to achieve the partition pruning on both fields would be very appreciated.

mdoblado
  • 66
  • 4

0 Answers0