I'm new to Spark and Scala so I might have misunderstood some basic things here. I'm trying to train Sparks word2vec model on my own data. According to their documentation, one way to do this is
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
The text8
dataset contains one line of many words, meaning that input
will become an RDD[Seq[String]]
.
After massaging my own dataset, which has one word per line, using different map
s etc. I'm left with an RDD[String]
, but I can't seem to be able to train the word2vec model on it. I tried doing input.map(v => Seq(v))
which does actually give an RDD[Seq[String]]
, but that will give one sequence for each word, which I guess is totally wrong.
How can I wrap a sequence around my strings, or is there something else I have missed?
EDIT
So I kind of figured it out. From my clean
being an RDD[String]
I do val input = sc.parallelize(Seq(clean.collect().toSeq))
. This gives me the correct data structure (RDD[Seq[String]]
) to fit the word2vec model. However, running collect on a large dataset gives me out of memory error. I'm not quite sure how they intend the fitting to be done? Maybe it is not really parallelizable. Or maybe I'm supposed to have several semi-long sequences of strings inside and RDD, instead of one long sequence like I have now?