Why does dask DataFrame.to_parquet try to infer the data schema when storing the file to disk?

Question

I'm having some trouble making sense of Dask's to_parquet method and why it has a schema argument. When I have a Dask DataFrame variable named ddf and access ddf.dtypes, I can see the Data Types of each column, meaning that Dask does know the dtype of each column, right? If that's the case, then why does it need to infer the data schema when I do ddf.to_parquet?

I'm asking this because according to Dask's documentation (https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html), the default value for the schema argument is "infer", which seems counter-intuitive to me.

I know this is a basic question, but I'm a Dask newbie trying to make sense of the tool.

Thanks.

score 3 · Accepted Answer · answered Nov 03 '22 at 16:56

The schema of a parquet dataset, or its arrow equivalent, is not the same sort of thing as the dtypes list of a dataframe. There is an imprecise mapping between the two, so sometimes it's worth being explicit. This is particularly true when a column contains no valid data at all in some partition and has type "object" - this might be intended to be strings (the most common), but might be something different, and you cannot tell for sure from the dask meta or necessarily from the data. Also, arrow sometimes wants to make null-type columns when all the rows are None/nan/NA.

NB: fastparquet provides a similar object_encoding= kwarg for the specific case of object type columns, which are nearly always the problem.

Interesting. How would one create a schema from a dataframe's dtypes list? — jtlz2, Dec 18 '22 at 21:03
Ah thanks/nvm - got it! https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.from_pandas — jtlz2, Dec 18 '22 at 21:06

Why does dask DataFrame.to_parquet try to infer the data schema when storing the file to disk?

1 Answers1