How to force parquet dtypes when saving pd.DataFrame?

Question

Is there a way to force a parquet file to encode a pd.DataFrame column as a given type, even though all values for the column are null? The fact that parquet automatically assigns "null" in its schema is preventing me from loading many files into a single dask.dataframe.

Trying to cast the pandas column using df.column_name = df.column_name.astype(sometype) didn't work.

Why I'm asking this

I want to load many parquet files into a single dask.dataframe. All files were generated from as many instances of pd.DataFrame, using df.to_parquet(filename). All dataframes have the same columns, but for some a given column might contain only null values. When trying to load all files into the dask.dataframe (using df = dd.read_parquet('*.parquet') , I get the following error:

Schema in filename.parquet was different.
id: int64
text: string
[...]
some_column: double

vs

id: int64
text: string
[...]
some_column: null

Steps to reproduce my problem

import pandas as pd
import dask.dataframe as dd
a = pd.DataFrame(['1', '1'], columns=('value',))
b = pd.DataFrame([None, None], columns=('value',))
a.to_parquet('a.parquet')
b.to_parquet('b.parquet')
df = dd.read_parquet('*.parquet')  # Reads a and b

This gives me the following:

ValueError: Schema in path/to/b.parquet was different. 
value: null
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "value", "field_name": "value", "pandas_type": "empty'
            b'", "numpy_type": "object", "metadata": null}, {"name": null, "fi'
            b'eld_name": "__index_level_0__", "pandas_type": "int64", "numpy_t'
            b'ype": "int64", "metadata": null}], "pandas_version": "0.22.0"}'}

vs

value: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "value", "field_name": "value", "pandas_type": "unico'
            b'de", "numpy_type": "object", "metadata": null}, {"name": null, "'
            b'field_name": "__index_level_0__", "pandas_type": "int64", "numpy'
            b'_type": "int64", "metadata": null}], "pandas_version": "0.22.0"}'}

Notice how in one case we have "pandas_type": "unicode" and in the other we have "pandas_type": "empty".

Related questions that didn't provide me with a solution

How to specify logical types when writing Parquet files from PyArrow?

score 11 · Accepted Answer · answered May 02 '18 at 14:57

11

If you instead use fastparquet, you can achieve chat you want

import pandas as pd
import dask.dataframe as dd
a = pd.DataFrame(['1', '1'], columns=('value',))
b = pd.DataFrame([None, None], columns=('value',))
a.to_parquet('a.parquet', object_encoding='int', engine='fastparquet')
b.to_parquet('b.parquet', object_encoding='int', engine='fastparquet')

dd.read_parquet('*.parquet').compute()

gives

   value
0    1.0
1    1.0
0    NaN
1    NaN

answered May 02 '18 at 14:57

mdurant

27,272
5
45
74

Interestingly, the dtype of the column is float64 – mdurant May 02 '18 at 15:08
2

NaN is implemented as float http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na – dylan_fan Jun 07 '18 at 14:56
Is there any solution using `pyarrow` ? We had a similar problem and ran `repartition` (heavy process but worked) – skibee Oct 03 '19 at 05:54
1

parquet reading and writing were recently refactored, worth trying again – mdurant Oct 03 '19 at 12:35
@mdurant can you please clarify why using fastparquet works? I'm facing a similar issue with Dask DataFrames and can't figure out why I can store them to disk with fastparquet but not with pyarrow. In fact, dask's `to_parquet` method even ignores the argument `schema` when the engine being used is `fastparquet` and I really can't figure out why. – Vinicius Silva Nov 01 '22 at 23:42
Basically, fastparquet gives you this lever to be explicit, where arrow decides for you. Only the arrow backend supports (or should) passing in a schema, as the docstring of to_parquet says. – mdurant Nov 02 '22 at 01:32

How to force parquet dtypes when saving pd.DataFrame?

1 Answers1