I'm generating Parquet files via two methods: a Kinesis Firehose and a Spark job. They are both written into the same partition structure on S3. Both sets of data can be queried using the same Athena table definition. Both use gzip compression.
I'm noticing, however, that the Parquet files generated by Spark are about 3x as large as those from Firehose. Any reason this should be the case? I do notice some schema and metadata differences when I load them using Pyarrow:
>>> import pyarrow.parquet as pq
>>> spark = pq.ParquetFile('<spark object name>.gz.parquet')
>>> spark.metadata
<pyarrow._parquet.FileMetaData object at 0x101f2bf98>
created_by: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
num_columns: 4
num_rows: 11
num_row_groups: 1
format_version: 1.0
serialized_size: 1558
>>> spark.schema
<pyarrow._parquet.ParquetSchema object at 0x101f2f438>
uri: BYTE_ARRAY UTF8
dfpts.list.element: BYTE_ARRAY UTF8
udids.list.element: BYTE_ARRAY UTF8
uuids.list.element: BYTE_ARRAY UTF8
>>> firehose = pq.ParquetFile('<firehose object name>.parquet')
>>> firehose.metadata
<pyarrow._parquet.FileMetaData object at 0x10fc63458>
created_by: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
num_columns: 4
num_rows: 156
num_row_groups: 1
format_version: 1.0
serialized_size: 1017
>>> firehose.schema
<pyarrow._parquet.ParquetSchema object at 0x10fc5e7b8>
udids.bag.array_element: BYTE_ARRAY UTF8
dfpts.bag.array_element: BYTE_ARRAY UTF8
uuids.bag.array_element: BYTE_ARRAY UTF8
uri: BYTE_ARRAY UTF8
Is it likely that the schema difference is the culprit? Something else?
These two specific files don't contain the exact same data, but based on my Athena queries the total cardinality of all lists for all rows in the Firehose file is roughly 2.5x what's in the Spark file.
EDITED TO ADD:
I wrote the following to essentially dump the contents of each parquet file to stdout one row per line:
import sys
import pyarrow.parquet as pq
table = pq.read_table(sys.argv[1])
pydict = table.to_pydict()
for i in range(0, table.num_rows):
print(f"{pydict['uri'][i]}, {pydict['dfpts'][i]}, {pydict['udids'][i]}, {pydict['uuids'][i]}")
I then ran that against each parquet file and piped the output to a file. Here are the sizes of the original two files, the output of pointing the above python code at each file, and the gzipped version of that output:
-rw-r--r-- 1 myuser staff 1306337 Jun 28 16:19 firehose.parquet
-rw-r--r-- 1 myuser staff 8328156 Jul 2 15:09 firehose.printed
-rw-r--r-- 1 myuser staff 5009543 Jul 2 15:09 firehose.printed.gz
-rw-r--r-- 1 myuser staff 1233761 Jun 28 16:23 spark.parquet
-rw-r--r-- 1 myuser staff 3213528 Jul 2 15:09 spark.printed
-rw-r--r-- 1 myuser staff 1951058 Jul 2 15:09 spark.printed.gz
Notice that the two parquet files are approximately the same size, but the "printed" content of the firehose file is approximately 2.5x the size of the "printed" content from the spark file. And they're about equally compressible.
So: what is taking up all the space in the Spark parquet file if it's not the raw data?
EDITED TO ADD:
Below is the output from "parquet-tools meta". The compression ratios for each column look similar, but the firehose file contains many more values per uncompressed byte. For the "dfpts" column:
firehose:
SZ:667849/904992/1.36 VC:161475
spark:
SZ:735561/1135861/1.54 VC:62643
parquet-tools meta output:
file: file:/Users/jh01792/Downloads/firehose.parquet
creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
file schema: hive_schema
--------------------------------------------------------------------------------
udids: OPTIONAL F:1
.bag: REPEATED F:1
..array_element: OPTIONAL BINARY L:STRING R:1 D:3
dfpts: OPTIONAL F:1
.bag: REPEATED F:1
..array_element: OPTIONAL BINARY L:STRING R:1 D:3
uuids: OPTIONAL F:1
.bag: REPEATED F:1
..array_element: OPTIONAL BINARY L:STRING R:1 D:3
uri: OPTIONAL BINARY L:STRING R:0 D:1
row group 1: RC:156 TS:1905578 OFFSET:4
--------------------------------------------------------------------------------
udids:
.bag:
..array_element: BINARY GZIP DO:0 FPO:4 SZ:421990/662241/1.57 VC:60185 ENC:RLE,PLAIN_DICTIONARY ST:[num_nulls: 58, min/max not defined]
dfpts:
.bag:
..array_element: BINARY GZIP DO:0 FPO:421994 SZ:667849/904992/1.36 VC:161475 ENC:RLE,PLAIN_DICTIONARY ST:[num_nulls: 53, min/max not defined]
uuids:
.bag:
..array_element: BINARY GZIP DO:0 FPO:1089843 SZ:210072/308759/1.47 VC:39255 ENC:RLE,PLAIN_DICTIONARY ST:[num_nulls: 32, min/max not defined]
uri: BINARY GZIP DO:0 FPO:1299915 SZ:5397/29586/5.48 VC:156 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max not defined]
file: file:/Users/jh01792/Downloads/spark.parquet
creator: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"uri","type":"string","nullable":false,"metadata":{}},{"name":"dfpts","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}},{"name":"udids","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}},{"name":"uuids","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
uri: REQUIRED BINARY L:STRING R:0 D:0
dfpts: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL BINARY L:STRING R:1 D:3
udids: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL BINARY L:STRING R:1 D:3
uuids: OPTIONAL F:1
.list: REPEATED F:1
..element: OPTIONAL BINARY L:STRING R:1 D:3
row group 1: RC:11 TS:1943008 OFFSET:4
--------------------------------------------------------------------------------
uri: BINARY GZIP DO:0 FPO:4 SZ:847/2530/2.99 VC:11 ENC:PLAIN,BIT_PACKED ST:[num_nulls: 0, min/max not defined]
dfpts:
.list:
..element: BINARY GZIP DO:0 FPO:851 SZ:735561/1135861/1.54 VC:62643 ENC:RLE,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max not defined]
udids:
.list:
..element: BINARY GZIP DO:0 FPO:736412 SZ:335289/555989/1.66 VC:23323 ENC:RLE,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max not defined]
uuids:
.list:
..element: BINARY GZIP DO:0 FPO:1071701 SZ:160494/248628/1.55 VC:13305 ENC:RLE,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max not defined]