I have a spark job that writes data to parquet files with snappy compression. One of the columns in parquet is a repeated INT64.
When upgrading from spark 2.2 with parquet 1.8.2 to spark 3.1.1 with parquet 1.10.1, I witnessed a severe degradation in compression ratio.
For this file for example (saved with spark 2.2) I have the following metadata:
creator: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"numbers","type":{"type":"array","elementType":"long","containsNull":true},"nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
numbers: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL INT64 R:1 D:3
row group 1: RC:186226 TS:163626010 OFFSET:4
--------------------------------------------------------------------------------
numbers:
.list:
..element: INT64 SNAPPY DO:0 FPO:4 SZ:79747617/163626010/2.05 VC:87158527 ENC:RLE,PLAIN_DICTIONARY ST:[min: 4, max: 1967324, num_nulls: 39883]
Reading it with spark 3.1 and saving it again as parquet, I get the following metadata, and parquet part size increases from 76MB to 124MB:
creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
extra: org.apache.spark.version = 3.1.1
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"numbers","type":{"type":"array","elementType":"long","containsNull":true},"nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
numbers: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL INT64 R:1 D:3
row group 1: RC:186226 TS:163655597 OFFSET:4
--------------------------------------------------------------------------------
numbers:
.list:
..element: INT64 SNAPPY DO:0 FPO:4 SZ:129657160/163655597/1.26 VC:87158527 ENC:RLE,PLAIN_DICTIONARY ST:[min: 4, max: 1967324, num_nulls: 39883]
Note the compression ratio decreased from 2.05
to 1.26
.
Tried looking for any configuration that have changed between spark or parquet versions. The only thing I could find is parquet.writer.max-padding
changed from 0
to 8MB
, but even when changing this configuration back to 0
, I get the same results.
Below is the ParquetOutputFormat configuration I have with both setups:
Parquet block size to 134217728
Parquet page size to 1048576
Parquet dictionary page size to 1048576
Dictionary is on
Validation is off
Writer version is: PARQUET_1_0
Maximum row group padding size is 0 bytes
Page size checking is: estimated
Min row count for page size check is: 100
Max row count for page size check is: 10000
I would appreciate any guidance here.
Thanks!
UPDATE
Checked spark 3 with snappy 1.1.2.6 (used by spark 2.2), and the compression ratio looks good. Will further look into this issue, and update on my findings.