8

I am working with Apache Spark to generate parquet files. I can partition them by date with no problems, but internally I can not seem to lay out the data in the correct order.

The order seems to get lost during processing, which means the parquet metadata is not right (specifically I want to ensure that the parquet row groups are reflecting sorted order so that queries specific to my use case can filter efficiently via the metadata).

Consider the following example:

// note: hbase source is a registered temp table generated from hbase
val transformed = sqlContext.sql(s"SELECT  id, sampleTime, ... , toDate(sampleTime) as date FROM hbaseSource")

// Repartion the input set by the date column (in my source there should be 2 distinct dates)
val sorted = transformed.repartition($"date").sortWithinPartitions("id", "sampleTime")

sorted.coalesce(1).write.partitionBy("date").parquet(s"/outputFiles")

With this approach, I do get the right parquet partition structure ( by date). And even better, for each date partition, I see a single large parquet file.

 /outputFiles/date=2018-01-01/part-00000-4f14286c-6e2c-464a-bd96-612178868263.snappy.parquet

However, when I query the file I see the contents out of order. To be specific, "out of order" seems more like several ordered data-frame partitions have been merged into the file.

The parquet row group metadata shows that the sorted fields are actually overlapping ( a specific id, for example, could be located in many row groups ):

id:             :[min: 54, max: 65012, num_nulls: 0]
sampleTime:     :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id:             :[min: 827, max: 65470, num_nulls: 0]
sampleTime:     :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id:             :[min: 1629, max: 61412, num_nulls: 0]

I want the data to be properly ordered inside each file so the metadata min/max in each row group are non-overlapping.

For example, this is the pattern I want to see:

RG 0: id:             :[min: 54, max: 100, num_nulls: 0]
RG 1: id:             :[min: 100, max: 200, num_nulls: 0]

... where RG = "row group". If I wanted id = 75, the query could find it in one row group.

I have tried many variations of the above code. For example with and without coalesce (I know coalesce is bad, but my idea was to use it to prevent shuffling). I have also tried sort instead of sortWithinPartitions (sort should create a total ordered sort, but will result in many partitions). For example:

val sorted = transformed.repartition($"date").sort("id", "sampleTime") 
sorted.write.partitionBy("date").parquet(s"/outputFiles")

Gives me 200 files, which is too many, and they are still not sorted correctly. I can reduce the file count by adjusting the shuffle size, but I would have expected sort to be processed in order during the write (I was under the impression that writes did not shuffle the input). The order I see is as follows (other fields omitted for brevity):

+----------+----------------+
|id|      sampleTime|
+----------+----------------+
|     56868|1514840220000000|
|     57834|1514785180000000|
|     56868|1514840220000000|
|     57834|1514785180000000|
|     56868|1514840220000000|

Which looks like it's interleaved sorted partitions. So I think repartition buys me nothing here, and sort seems to be incapable of preserving order on the write step.

I've read that what I want to do should be possible. I've even tried the approach outlined in the presentation "Parquet performance tuning: The missing guide" by Ryan Blue ( unfortunately it is behind the OReily paywall). That involves using insertInto. In that case, spark seemed to use an old version of parquet-mr which corrupted the metadata, and I am not sure how to upgrade it.

I am not sure what I am doing wrong. My feeling is that I am misunderstanding the way repartition($"date") and sort work and/or interact.

I would appreciate any ideas. Apologies for the essay. :)

edit: Also note that if I do a show(n) on transformed.sort("id", "sampleTime") the data is sorted correctly. So it seems like the problem occurs during the write stage. As noted above, it does seem like the output of the sort is shuffled during the write.

Shaido
  • 27,497
  • 23
  • 70
  • 73
ZenMasterZed
  • 203
  • 2
  • 8

2 Answers2

10

The problem is that while saving file format, Spark is requiring some order and if the order is not satisfied, Spark will sort the data during the saving process according to the requirement and will forget your sort. To be more specific Spark requires this order (and this is taken directly from the Spark source code of Spark 2.4.4):

val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns

where partitionColumns are columns by which you partition the data. You are not using bucketing so bucketingIdExpression and sortColumns are not relevant in this example and the requiredOrdering will be only the partitionColumns. So if this is your code:

val sorted = transformed.repartition($"date").sortWithinPartitions("id", 
"sampleTime")

sorted.write.partitionBy("date").parquet(s"/outputFiles")

Spark will check if the data is sorted by date, which is not, so Spark will forget your sort and will sort it by date. On the other hand if you instead do it like this:

val sorted = transformed.repartition($"date").sortWithinPartitions("date", "id", 
"sampleTime")

sorted.write.partitionBy("date").parquet(s"/outputFiles")

Spark will check again if the data is sorted by date and this time it is (the requirement is satisfied) so Spark will preserve this order and will induce no more sorts while saving the data. So i believe this way it should work.

David Vrba
  • 2,984
  • 12
  • 16
  • Nit: `sortColumns` are **relevant** in the example; the issue is that Spark ignores them (it) unless `partitionColumns` are also `sortWithinPartitions`-ed. – ijoseph Jul 05 '22 at 18:34
0

Just idea, sort after coalesce: ".coalesce(1).sortWithinPartitions()". Also expected result looks strange - why ordered data in parquet required? Sorting after reading looks more appropriate.

pasha701
  • 6,831
  • 1
  • 15
  • 22
  • 1
    Thanks for your response. As stated, sorting the data prior to saving it into parquet, means that the parquet metadata will be optimal for my use case. This is a well known parquet optimization for queries with specific filters since the reader can skip entire parquet row groups ( vs having to inspect them ). I have tried numerous variations of `coalesce` and `sortWithinPartitions` (including your suggestion). As per my final note, the result from a straight sort operation is ordered fine. The ordering seems to change on the final write. – ZenMasterZed Sep 05 '18 at 01:24