I am working on duplicate documents detection problem using LSH algorithm. To handle large-scale data, we are using spark.
I have around 300K documents with at least 100-200 words per document. On spark cluster, these are the steps we are performing on data frame.
- Run Spark ML pipeline for converting text into tokens.
pipeline = Pipeline().setStages([
docAssembler,
tokenizer,
normalizer,
stemmer,
finisher,
stopwordsRemover,
# emptyRowsRemover
])
model = pipeline.fit(spark_df)
final_df = model.transform(spark_df)
- For each document, get MinHash value using datasketch(https://github.com/ekzhu/datasketch/) library and store it as a new column.
final_df_limit.rdd.map(lambda x: (CalculateMinHash(x),)).toDF()
2nd step is failing as spark is not allowing us to store custom type value as a column. Value is an object of class MinHash.
Does anyone know how can i store Minhash objects in dataframes?