Is it possible to store custom class object in Spark Data Frame as a column value?

Question

I am working on duplicate documents detection problem using LSH algorithm. To handle large-scale data, we are using spark.

I have around 300K documents with at least 100-200 words per document. On spark cluster, these are the steps we are performing on data frame.

Run Spark ML pipeline for converting text into tokens.


pipeline = Pipeline().setStages([
        docAssembler,
        tokenizer,
        normalizer,
        stemmer,
        finisher,
        stopwordsRemover,
       # emptyRowsRemover
    ])
model = pipeline.fit(spark_df)
final_df = model.transform(spark_df)

For each document, get MinHash value using datasketch(https://github.com/ekzhu/datasketch/) library and store it as a new column.

final_df_limit.rdd.map(lambda x: (CalculateMinHash(x),)).toDF()

2nd step is failing as spark is not allowing us to store custom type value as a column. Value is an object of class MinHash.

Does anyone know how can i store Minhash objects in dataframes?

score 3 · Accepted Answer · answered Jan 16 '19 at 08:39

I don't think it might be possible to save python objects in DataFrames, but you can circumvent this in a couple of ways:

Store the result instead of the object (not sure about how MinHash works, but if the value is numerical/string, it should be easy to extract it from the class object).
If that is not feasible because you still need some properties of the object, you might want to serialize it using Pickle, saving the serialized result as an encoded string. This forces you to de-serialize every time that you want to use the object.

final_df_limit.rdd.map(lambda x: base64.encodestring(pickle.dumps(CalculateMinHash(x),))).toDF()
An alternative might be to use the Spark MinHash implementation instead, but that might not suit all your requirements.

Thanks @martinarroyo. I resolved this using 1st option. I am getting Array of long value from minhash object and converting it into string. — user2058320, Jan 18 '19 at 22:42
Glad to hear that! It might be good if you could update your question with your final solution for completeness. — martinarroyo, Jan 18 '19 at 23:22

Is it possible to store custom class object in Spark Data Frame as a column value?

1 Answers1

Linked