I have 30TB of data partioned by date and hour, split into 300 files each hour. I do some data conversion & then want the data sorted and saved in the sorted order for easy ingestion by a C++ program. I understand that when you serialize, the ordering is only correct within a file. I was hoping to circumvent this by better partitioning the data.
I want to sort by both sessionID and timestamp. I don't want sessionIDs split between different files. If I partition on the SessionID I will have far too many so I do a modulo N to generate N buckets, aiming to get 1 bucket of data about 100-200MB:
df = df.withColumn("bucket", F.abs(F.col("sessionId")) % F.lit(50))
I then repatriation by date, hour and bucket, before sorting
df = df.repartition(50,"dt","hr","bucket")
df = df.sortWithinPartitions("sessionId","timestamp")
df.write.option("compression","gzip").partitionBy("dt","hr","bucket").parquet(SAVE_PATH)
This saves data to dt/hr/bucket, 1 file in each bucket but the ordering is lost. If I don't create buckets and repartition, then I end up with 200 files, the data is ordered, but the sessionIds are split across multiple files.
EDIT:
The issue seems to be when saving with partitionBy("dt","hr","bucket")
, which randomly repartitions the data so it is no longer sorted. If I save without partitionBy
then I get exactly what I expect - N files for N buckets/partitions and sessionIds span a single file, all sorted correctly. So I have a non-spark hack manually iterating over all date + hour directories
Seems like a bug if you partition by a column, sort, then write with partitionBy with the same column then you expect a direct dump of the sorted partitions, not some randomized re-shuffling of the data.