Why does Dask seem to store Parquet inefficiently

Question

When I save the same table using Pandas and Dask into Parquet, Pandas creates a 4k file, wheres Dask creates a 39M file.

Create the dataframe

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd

n = int(1e7)
df = pd.DataFrame({'col': ['a'*64]*n})

Save it in different ways

# Pandas: 4k
df.to_parquet('example-pandas.parquet')

# PyArrow: 4k
pq.write_table(pa.Table.from_pandas(df), 'example-pyarrow.parquet')

# Dask: 39M
dd.from_pandas(df, npartitions=1).to_parquet('example-dask.parquet', compression='snappy')

At first I thought that Dask doesn't utilize dictionary and run-length encoding, but that does not seem to be the case. I am not sure if I'm interpreting the metadata info correctly, but at the very least, it seems to be exactly the same:

>>> pq.read_metadata('example-pandas.parquet').row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fbee7d1a770>
  file_offset: 548
  file_path: 
  physical_type: BYTE_ARRAY
  num_values: 10000000
  path_in_schema: col
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7fbee7d2cc70>
      has_min_max: True
      min: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
      max: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
      null_count: 0
      distinct_count: 0
      num_values: 10000000
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 29
  total_compressed_size: 544
  total_uncompressed_size: 596

>>> pq.read_metadata('example-dask.parquet/part.0.parquet').row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fbee7d3d180>
  file_offset: 548
  file_path: 
  physical_type: BYTE_ARRAY
  num_values: 10000000
  path_in_schema: col
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7fbee7d3d1d0>
      has_min_max: True
      min: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
      max: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
      null_count: 0
      distinct_count: 0
      num_values: 10000000
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 29
  total_compressed_size: 544
  total_uncompressed_size: 596

Why is Dask-create Parquet so much larger? Alternatively, how can I inspect the possible problems further?

maybe file an issue with dask developers? – drum Aug 07 '21 at 00:09 — drum, Aug 07 '21 at 00:09
@drum good point: https://github.com/dask/dask/issues/8009 – Dahn Aug 07 '21 at 00:14 — Dahn, Aug 07 '21 at 00:14

score 3 · Accepted Answer · answered Aug 07 '21 at 01:52

Dask appears to be saving an int64 index...

>>> meta.row_group(0).column(1)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fa41e1babd0>
  file_offset: 40308181
  file_path: 
  physical_type: INT64
  num_values: 10000000
  path_in_schema: __null_dask_index__
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7fa41e1badb0>
      has_min_max: True
      min: 0
      max: 9999999
      null_count: 0
      distinct_count: 0
      num_values: 10000000
      physical_type: INT64
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE', 'PLAIN')
  has_dictionary_page: True
  dictionary_page_offset: 736
  data_page_offset: 525333
  total_compressed_size: 40307445
  total_uncompressed_size: 80284661

You can disable this with write_index:

dd.from_pandas(df, npartitions=1).to_parquet('example-dask.parquet', compression='snappy', write_index=False)

Pyarrow won't generate any indices.

Pandas does generate an index but, at least when using the arrow engine, simple linear indices will be saved as metadata and not an actual column.

>>> table = pq.read_table('example-pandas.parquet')
>>> pandas_meta = json.loads(table.schema.metadata[b'pandas'])
>>> pandas_meta['index_columns'][0]
{'kind': 'range', 'name': None, 'start': 0, 'stop': 10000000, 'step': 1}

Aha, well that of course makes perfect sense. Still, good to remember that the default index will be this large. — Dahn, Aug 07 '21 at 05:58

Why does Dask seem to store Parquet inefficiently

1 Answers1