Training Sparks word2vec with a RDD[String]

Question

I'm new to Spark and Scala so I might have misunderstood some basic things here. I'm trying to train Sparks word2vec model on my own data. According to their documentation, one way to do this is

val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)

The text8 dataset contains one line of many words, meaning that input will become an RDD[Seq[String]].

After massaging my own dataset, which has one word per line, using different maps etc. I'm left with an RDD[String], but I can't seem to be able to train the word2vec model on it. I tried doing input.map(v => Seq(v)) which does actually give an RDD[Seq[String]], but that will give one sequence for each word, which I guess is totally wrong.

How can I wrap a sequence around my strings, or is there something else I have missed?

EDIT

So I kind of figured it out. From my clean being an RDD[String] I do val input = sc.parallelize(Seq(clean.collect().toSeq)). This gives me the correct data structure (RDD[Seq[String]]) to fit the word2vec model. However, running collect on a large dataset gives me out of memory error. I'm not quite sure how they intend the fitting to be done? Maybe it is not really parallelizable. Or maybe I'm supposed to have several semi-long sequences of strings inside and RDD, instead of one long sequence like I have now?

score 1 · Accepted Answer · answered May 16 '16 at 18:36

It seems that the documentation is updated in an other location (even though I was looking at the "latest" docs). New docs are at: https://spark.apache.org/docs/latest/ml-features.html

The new example drops the text8 example file alltogether. I'm doubting whether the original example ever worked as intended. The RDD input to word2vec should be a set of lists of strings, typically sentences or otherwise constructed n-grams.

Example included for other lost souls:

val documentDF = sqlContext.createDataFrame(Seq(
  "Hi I heard about Spark".split(" "),
  "I wish Java could use case classes".split(" "),
  "Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")

// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
  .setInputCol("text")
  .setOutputCol("result")
  .setVectorSize(3)
  .setMinCount(0)
val model = word2Vec.fit(documentDF)

score 0 · Answer 2 · answered May 14 '16 at 06:34

0

Why not

input.map(v => v.split(" "))

or whatever would be an appropriate delimiter to split your words on. This will give you the desired sequence of strings - but with valid words.

answered May 14 '16 at 06:34

WestCoastProjects

58,982
91
316
560

That's basically what I'm doing to prepare my own data. As far as I understand, this will give an `RDD[String]` as I wrote in my question. This won't work as the word2vec model seems to take in an `RDD[Seq[String]]` as input. – burk May 15 '16 at 16:19

score 0 · Answer 3 · answered Jun 16 '17 at 07:50

As I can recall, word2vec in ml take dataframe as argument and word2vec in mllib can take rdd as argument. The example you posted is for word2vec in ml. Here is the official guide: https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec

Training Sparks word2vec with a RDD[String]

3 Answers3