How to write a Dask dataframe containing a column of arrays to a parquet file

Question

I have a Dask dataframe, one column of which contains a numpy array of floats:

import dask.dataframe as dd
import pandas as pd
import numpy as np

df = dd.from_pandas(
    pd.DataFrame(
        {
            'id':range(1, 6),
            'vec':[np.array([1.0, 2.0, 3.0, 4.0, 5.0])] * 5
        }), npartitions=1)

df.compute()

   id                        vec
0   1  [1.0, 2.0, 3.0, 4.0, 5.0]
1   2  [1.0, 2.0, 3.0, 4.0, 5.0]
2   3  [1.0, 2.0, 3.0, 4.0, 5.0]
3   4  [1.0, 2.0, 3.0, 4.0, 5.0]
4   5  [1.0, 2.0, 3.0, 4.0, 5.0]

If I try writing this out as parquet I get an error:

df.to_parquet('somefile')
....
Error converting column "vec" to bytes using encoding UTF8. Original error: bad argument type for built-in operation

I presume this is because the 'vec' column has type 'object', and so the parquet serializer tries to write it as a string. Is there some way to tell either the Dask DataFrame or the serializer that the column is an array of float?

I would raise this as an issue at https://github.com/dask/dask/issues/new — MRocklin, Feb 14 '18 at 19:18

junichiro · Accepted Answer · 2018-02-14T20:05:23.550

6

I have discovered it is possible if the pyarrow engine is used instead of the default fastparquet:

pip/conda install pyarrow

then:

df.to_parquet('somefile', engine='pyarrow')

The docs for fastparquet at https://github.com/dask/fastparquet/ say "only simple data-types and plain encoding are supported", so I guess that means no arrays.

edited Feb 14 '18 at 20:05

answered Feb 14 '18 at 19:21

junichiro

5,282
3
18
26

Thanks @junichiro - I was used a `FIXED_BYTE_ARRAY` which was causing the error, which seems like a fairly fundamental type. I've not been able to find docs on what exactly "simple data-types" are - if anyone knows please share! – schuess Dec 04 '20 at 21:17

How to write a Dask dataframe containing a column of arrays to a parquet file

1 Answers1