I am working with Apache Spark to generate parquet files. I can partition them by date with no problems, but internally I can not seem to lay out the data in the correct order.
The order seems to get lost during processing, which means the parquet metadata is not right (specifically I want to ensure that the parquet row groups are reflecting sorted order so that queries specific to my use case can filter efficiently via the metadata).
Consider the following example:
// note: hbase source is a registered temp table generated from hbase
val transformed = sqlContext.sql(s"SELECT id, sampleTime, ... , toDate(sampleTime) as date FROM hbaseSource")
// Repartion the input set by the date column (in my source there should be 2 distinct dates)
val sorted = transformed.repartition($"date").sortWithinPartitions("id", "sampleTime")
sorted.coalesce(1).write.partitionBy("date").parquet(s"/outputFiles")
With this approach, I do get the right parquet partition structure ( by date). And even better, for each date partition, I see a single large parquet file.
/outputFiles/date=2018-01-01/part-00000-4f14286c-6e2c-464a-bd96-612178868263.snappy.parquet
However, when I query the file I see the contents out of order. To be specific, "out of order" seems more like several ordered data-frame partitions have been merged into the file.
The parquet row group metadata shows that the sorted fields are actually overlapping ( a specific id, for example, could be located in many row groups ):
id: :[min: 54, max: 65012, num_nulls: 0]
sampleTime: :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id: :[min: 827, max: 65470, num_nulls: 0]
sampleTime: :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id: :[min: 1629, max: 61412, num_nulls: 0]
I want the data to be properly ordered inside each file so the metadata min/max in each row group are non-overlapping.
For example, this is the pattern I want to see:
RG 0: id: :[min: 54, max: 100, num_nulls: 0]
RG 1: id: :[min: 100, max: 200, num_nulls: 0]
... where RG = "row group". If I wanted id = 75
, the query could find it in one row group.
I have tried many variations of the above code. For example with and without coalesce
(I know coalesce is bad, but my idea was to use it to prevent shuffling). I have also tried sort
instead of sortWithinPartitions
(sort should create a total ordered sort, but will result in many partitions). For example:
val sorted = transformed.repartition($"date").sort("id", "sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Gives me 200 files, which is too many, and they are still not sorted correctly. I can reduce the file count by adjusting the shuffle size, but I would have expected sort to be processed in order during the write (I was under the impression that writes did not shuffle the input). The order I see is as follows (other fields omitted for brevity):
+----------+----------------+
|id| sampleTime|
+----------+----------------+
| 56868|1514840220000000|
| 57834|1514785180000000|
| 56868|1514840220000000|
| 57834|1514785180000000|
| 56868|1514840220000000|
Which looks like it's interleaved sorted partitions. So I think repartition
buys me nothing here, and sort
seems to be incapable of preserving order on the write step.
I've read that what I want to do should be possible. I've even tried the approach outlined in the presentation "Parquet performance tuning:
The missing guide" by Ryan Blue ( unfortunately it is behind the OReily paywall). That involves using insertInto
. In that case, spark seemed to use an old version of parquet-mr which corrupted the metadata, and I am not sure how to upgrade it.
I am not sure what I am doing wrong. My feeling is that I am misunderstanding the way repartition($"date")
and sort
work and/or interact.
I would appreciate any ideas. Apologies for the essay. :)
edit:
Also note that if I do a show(n)
on transformed.sort("id", "sampleTime")
the data is sorted correctly. So it seems like the problem occurs during the write stage. As noted above, it does seem like the output of the sort is shuffled during the write.