1

I apply Spark's word2vec by using a dataframe. Here is my code:

val df2 = df.groupBy("LABEL").agg(collect_list("TERM").alias("TERM"))

    val word2Vec = new Word2Vec()
      .setInputCol("TERM")
      .setOutputCol("result")
      .setMinCount(0)

val model = word2Vec.fit(df2)
    val result = model.transform(df2)

    val synonyms = model.findSynonyms("4", 10)

    //synonyms.foreach(println)

    for((synonym, cosineSimilarity) <- synonyms) {
      println(s"$synonym $cosineSimilarity")
    }

When I use synonyms.foreach(println) the code works, however, the returned results are not ordered based on their similarity scores. Instead I have tried the for loop seen at bottom of the code. When applying it the following error has been thrown:

Error:(52, 40) missing parameter type for expanded function
The argument types of an anonymous function must be fully known. (SLS 8.5)
Expected type was: ?
    for((synonym, cosineSimilarity) <- synonyms) {
                                       ^

From other similar stackoverflow questions and the error, it seems the exact types of arguments are needed. In the for loop synonyms is a dataframe and the returned values have types String and Double, respectively. So all my trials have failed. How can I remedy this?

mlee_jordan
  • 772
  • 4
  • 18
  • 50

1 Answers1

0

The result of findSynonyms is a non-materialized Spark-internal DataFrame. You can not simply iterate on the result.

  def findSynonyms(word: Vector, num: Int): DataFrame = {
    ..
    sc.parallelize(wordVectors.findSynonyms(word, num)).toDF("word", "similarity")
  }

Note the reason foreach worked is that is a materialization method clearly defined on DataFrame's

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560