I try to create a vector of token counts for a LDA analysis in Spark 2.3.0. I have followed some tutorial and at each time they use CountVectorizer to easily convert Array of String to Vector.
I run this short example on my Databricks notebook :
import org.apache.spark.ml.feature.CountVectorizer
val testW = Seq(
(8, Array("Zara", "Nuha", "Ayan", "markle")),
(9, Array("fdas", "test", "Ayan", "markle")),
(10, Array("qwertzu", "test", "Ayan", "fdaf"))
).toDF("id", "filtered")
// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(5)
.setMinDF(2)
.fit(testW)
// Create vector of token counts
val articlesCountVector = vectorizer.transform(testW).select("id", "features")
display(articlesCountVector)
and the output is the following : output
But in all tutorial I have read, the type of "features" is vector. Why in my case is it udt ?
Did i forget something ? Why it is not a vector ?
Is it possible to convert it ? because I cannot create LDA model with this udt type.