I have dask dataframe that has a column of type List[MyClass]. I want to save this dataframe to parquet files. Dask is using pyarrow as the backend, but it supports only primitive types.
import pandas as pd
import dask.dataframe as dd
class MyClass:
def __init__(self, a):
self.a = a
def transform(v):
return [MyClass(v)]
a = [[1], [2], [3]]
pdf = pd.DataFrame.from_dict(a)
ddf = dd.from_pandas(pdf, npartitions=1)
result = ddf.assign(mycol=ddf[0].apply(transform))
result.to_parquet('my_parquet.parquet')
So when i try to save it i get this error:
ArrowInvalid: Error inferring Arrow data type for collection of Python objects. Got Python object of type MyClass but can only handle these types: bool, float, integer, date, datetime, bytes, unicode, decimal
.
Obviously i have to convert MyClass
to pyarrow compatible struct type, but i can't find a way how to do this. Pyarrow & dask have some serialization features (like this https://arrow.apache.org/docs/python/ipc.html#serializing-custom-data-types), but seems like that's not quite the thing i need.