When I save the same table using Pandas and Dask into Parquet, Pandas creates a 4k
file, wheres Dask creates a 39M
file.
Create the dataframe
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd
n = int(1e7)
df = pd.DataFrame({'col': ['a'*64]*n})
Save it in different ways
# Pandas: 4k
df.to_parquet('example-pandas.parquet')
# PyArrow: 4k
pq.write_table(pa.Table.from_pandas(df), 'example-pyarrow.parquet')
# Dask: 39M
dd.from_pandas(df, npartitions=1).to_parquet('example-dask.parquet', compression='snappy')
At first I thought that Dask doesn't utilize dictionary and run-length encoding, but that does not seem to be the case. I am not sure if I'm interpreting the metadata info correctly, but at the very least, it seems to be exactly the same:
>>> pq.read_metadata('example-pandas.parquet').row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fbee7d1a770>
file_offset: 548
file_path:
physical_type: BYTE_ARRAY
num_values: 10000000
path_in_schema: col
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7fbee7d2cc70>
has_min_max: True
min: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
max: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
null_count: 0
distinct_count: 0
num_values: 10000000
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 29
total_compressed_size: 544
total_uncompressed_size: 596
>>> pq.read_metadata('example-dask.parquet/part.0.parquet').row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fbee7d3d180>
file_offset: 548
file_path:
physical_type: BYTE_ARRAY
num_values: 10000000
path_in_schema: col
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7fbee7d3d1d0>
has_min_max: True
min: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
max: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
null_count: 0
distinct_count: 0
num_values: 10000000
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 29
total_compressed_size: 544
total_uncompressed_size: 596
Why is Dask-create Parquet so much larger? Alternatively, how can I inspect the possible problems further?