3

I have 30TB of data partioned by date and hour, split into 300 files each hour. I do some data conversion & then want the data sorted and saved in the sorted order for easy ingestion by a C++ program. I understand that when you serialize, the ordering is only correct within a file. I was hoping to circumvent this by better partitioning the data.

I want to sort by both sessionID and timestamp. I don't want sessionIDs split between different files. If I partition on the SessionID I will have far too many so I do a modulo N to generate N buckets, aiming to get 1 bucket of data about 100-200MB:

df = df.withColumn("bucket", F.abs(F.col("sessionId")) % F.lit(50))

I then repatriation by date, hour and bucket, before sorting

df = df.repartition(50,"dt","hr","bucket")
df = df.sortWithinPartitions("sessionId","timestamp")
df.write.option("compression","gzip").partitionBy("dt","hr","bucket").parquet(SAVE_PATH)

This saves data to dt/hr/bucket, 1 file in each bucket but the ordering is lost. If I don't create buckets and repartition, then I end up with 200 files, the data is ordered, but the sessionIds are split across multiple files.

EDIT: The issue seems to be when saving with partitionBy("dt","hr","bucket"), which randomly repartitions the data so it is no longer sorted. If I save without partitionBy then I get exactly what I expect - N files for N buckets/partitions and sessionIds span a single file, all sorted correctly. So I have a non-spark hack manually iterating over all date + hour directories

Seems like a bug if you partition by a column, sort, then write with partitionBy with the same column then you expect a direct dump of the sorted partitions, not some randomized re-shuffling of the data.

user1978816
  • 812
  • 1
  • 8
  • 19
  • 2
    Try using [`sortWithinPartitions()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sortWithinPartitions), for example like: `df.write.option("compression","gzip").partitionBy("dt","hr","bucket").sortWithinPartitions("sessionId", "timestamp").parquet(SAVE_PATH)` – pault Nov 15 '19 at 17:21
  • That is the kind of thing I have been trying but I get this error: `AttributeError: 'DataFrameWriter' object has no attribute 'sortWithinPartitions'` Seems when it comes to writing out the partitions you have very few options to control file sizes, buckets and ordering. As a begginer it seems very confusing that you can can order 30TB of data and then when it comes to save that you loose the expensive ordering – user1978816 Nov 15 '19 at 19:33
  • @syadav. https://quoteinvestigator.com/2012/04/28/shorter-letter/ I made some edits and improved the text, but I like to give enough context so people know what I intend to achieve because there might be a more holistic solution – user1978816 Nov 17 '19 at 07:18

1 Answers1

0

Putting the partition columns in the sorted columns list might do the trick.

Full description here - https://stackoverflow.com/a/59161488/3061686

akki
  • 2,021
  • 1
  • 24
  • 35