Pyspark - Load trained model word2vec

Question

I want to use word2vec with PySpark to process some data. I was previously using Google trained model GoogleNews-vectors-negative300.bin with gensim in Python.

Is there a way I can load this bin file with mllib.word2vec ? Or does it make sense to export the data as a dictionary from Python {word : [vector]} (or .csv file) and then load it in PySpark?

Thanks

I have already loaded pyspark models in the .parquet format. — igorkf, May 22 '20 at 20:47

score 2 · Accepted Answer · answered Oct 10 '21 at 10:48

Binary import is supported in Spark 3.x:

spark.read.format("binaryFile").option("pathGlobFilter", "*.png").load("/path/to/data")

However, this would require processing the binary data. Hence a gensim export is rather recommended:

# Save gensim model
filename = "stored_model.csv" 
trained_model.save(filename)

Then load the model in pyspark:

df = spark.read.load("stored_model.csv",
                     format="csv", 
                     sep=";", 
                     inferSchema="true", 
                     header="true")

Pyspark - Load trained model word2vec

1 Answers1