I'm looking to improve the performance on running filtering logic. To accomplish this, the idea is to do hive partitioning setting by setting the partition column to a column in the dataset (called splittable_column
).
I checked and the cardinality of splittable column is low, and if I subset each value from splitting_column
, the end result is a 800MB parquet file.
If the cardinality of my dataset is 3, my goal is to have the data laid out like:
spark/splittable_column=Value A/part-00000-abc.c000.snappy.parquet
spark/splittable_column=Value B/part-00000-def.c000.snappy.parquet
spark/splittable_column=Value C/part-00000-ghi.c000.snappy.parquet
When I run my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"])
, and look at the results, I see many files in the KB range within the directory, which is going to cause a large overhead during reading. For example my dataset looks like:
spark/splittable_column=Value A/part-00000-abc.c000.snappy.parquet
spark/splittable_column=Value A/part-00001-abc.c000.snappy.parquet
spark/splittable_column=Value A/part-00002-abc.c000.snappy.parquet
...
spark/splittable_column=Value A/part-00033-abc.c000.snappy.parquet
spark/splittable_column=Value B/part-00000-def.c000.snappy.parquet
...
spark/splittable_column=Value B/part-00030-def.c000.snappy.parquet
spark/splittable_column=Value C/part-00000-ghi.c000.snappy.parquet
...
spark/splittable_column=Value C/part-00032-ghi.c000.snappy.parquet
etc.
From the documentation I understand that:
you will have at least one output file for each unique value in your partition column
How do I configure the transform that I get at most 1 output file per task during Hive partitioning?