1

Running databricks to read csv files and then saving as a partitioned delta table.

Total records in file are 179619219 . It is being split on COL A (8419 unique values) and Year ( 10 Years) and Month.

df.write.partitionBy("A","year","month").format("delta") \
  .mode("append").save(path) 

Job gets stuck on the write step and aborts after running for 5-6 hours

Alex Ott
  • 80,552
  • 8
  • 87
  • 132

1 Answers1

2

This is very bad partitioning schema. You simply have too many unique values for column A, and additional partitioning is creating even more partitions. Spark will need to create at least 90k partitions, and this will require creation a separate files (small), etc. And small files are harming the performance.

For non-Delta tables, partitioning is primarily used to perform data skipping when reading data. But for Delta lake tables, partitioning may not be so important, as Delta on Databricks includes things like data skipping, you can apply ZOrder, etc.

I would recommend to use different partitioning schema, for example, year + month only, and do OPTIMIZE with ZOrder on A column after the data is written. This will lead to creation of only few partitions with bigger files.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • sorry to ask some basic question , when I repartiton data based on some columns , how to check how many partitions it created , i see lot of skewness while writing data on delta table , how to handle this ? any samples or document on this ? – Shasu Sep 26 '22 at 14:53
  • 1
    Simplest method - do `df.dropDuplicates(["col1", "col2"])` and the count... – Alex Ott Sep 26 '22 at 18:18
  • I did not understand the rationale behind de-duplicates and number of partitions .. given dropDuplicates only deletes any duplicate combination of col1 and col2 combination , right ? then how it gives number of partitions? – Shasu Sep 26 '22 at 19:47
  • 1
    Do it on columns that you select for partitioning – Alex Ott Sep 26 '22 at 20:00