0

I have a dataframe called article

+--------------------+
|     processed_title|
+--------------------+
|[new, relictual, ...|
|[once, upon,a,time..|
+--------------------+

I want to flatten it to get it as bag of words. How could I achieve this using the current situation. I have tried the code below which seems to give me a Type mismatch issue.

val bow_corpus = article.select("processed_title").rdd.flatMap(y => y)

I eventually want to use this bow_corpus to train a word2vec model.

Thanks

zero323
  • 322,348
  • 103
  • 959
  • 935
Krishna Kalyan
  • 1,672
  • 2
  • 20
  • 43

1 Answers1

1

Assuming that processed_title is represented in SQL as array<string>:

article.select("processed_title").rdd.flatMap(_.getSeq[String](0))

There is also Word2Vec transformer which can be trained directly on a DataFrame:

import org.apache.spark.ml.feature.Word2Vec

val word2Vec = new Word2Vec()
  .setInputCol("processed_title")
  .setOutputCol("vectors")
  .setMinCount(0)
  .fit(article)

word2Vec.findSynonyms("foo", 1)

See also Spark extracting values from a Row

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935