3

I have dask dataframe that has a column of type List[MyClass]. I want to save this dataframe to parquet files. Dask is using pyarrow as the backend, but it supports only primitive types.

import pandas as pd
import dask.dataframe as dd


class MyClass:

    def __init__(self, a):
        self.a = a


def transform(v):
    return [MyClass(v)]


a = [[1], [2], [3]]
pdf = pd.DataFrame.from_dict(a)
ddf = dd.from_pandas(pdf, npartitions=1)
result = ddf.assign(mycol=ddf[0].apply(transform))
result.to_parquet('my_parquet.parquet')

So when i try to save it i get this error:

ArrowInvalid: Error inferring Arrow data type for collection of Python objects. Got Python object of type MyClass but can only handle these types: bool, float, integer, date, datetime, bytes, unicode, decimal.

Obviously i have to convert MyClass to pyarrow compatible struct type, but i can't find a way how to do this. Pyarrow & dask have some serialization features (like this https://arrow.apache.org/docs/python/ipc.html#serializing-custom-data-types), but seems like that's not quite the thing i need.

cheap_grayhat
  • 400
  • 4
  • 7
  • Are you prepared to handle the serialisation (class<->bytes) yourself, and store the bytes only? – mdurant Jan 11 '19 at 16:44
  • no, i wanted to serialize with dask / pandas and deserialize with spark. Since complex schemas are not supported yet in pyarrow, i decided to use json as the intermediate format for this case. – cheap_grayhat Jan 14 '19 at 12:32
  • Does spark handle JSON now? I previously didn't. In any case, glad if that solves things (feel free to post an answer), but I am surprised, as I thought you wanted to store python classes. – mdurant Jan 14 '19 at 14:03

1 Answers1

4

a bit late, but maybe this link can help others.

It basically comes down to defining custom hand-made serialization functions. For example here's your class:

class MyData:
    def __init__(self, name, data):
        self.name = name
        self.data = data

You write functions to convert to/from this class like:

def _serialize_MyData(val):
    return {'name': val.name, 'data': val.data}

def _deserialize_MyData(data):
    return MyData(data['name'], data['data']

Then initialize a context from these functions to later give to the Serialization/Deserialization methods:

context = pa.SerializationContext()
context.register_type(MyData, 'MyData',
                      custom_serializer=_serialize_MyData,
                      custom_deserializer=_deserialize_MyData)

Now you call the serialize/deserialize methods and pass them the context:

buf = pa.serialize(val, context=context).to_buffer()
restored_val = pa.deserialize(buf, context=context)
Cypher
  • 2,374
  • 4
  • 24
  • 36
  • Do you know if there is a way to deserialize just part of the data? For example if data is huge, and I just want to extract the name, is there a way to extract just that from the serialized data? – Sukanya Dasgupta Sep 10 '20 at 15:09