3

I have a DataFrame with the following schema :

root
 |-- journal: string (nullable = true)
 |-- topicDistribution: vector (nullable = true)

The topicDistribution field is a vector of doubles: [0.1, 0.2 0.15 ...]

What I want is is to explode each row into several rows to obtain the following schema:

root
 |-- journal: string
 |-- topic-prob: double // this is the value from the vector
 |-- topic-id : integer // this is the index of the value from the vector

To clarify, I've created a case class:

case class JournalDis(journal: String, topic_id: Integer, prob: Double)

I've managed to achieve this using dataset.explode in a very awkward way:

val df1 = df.explode("topicDistribution", "topic") {
    topics: DenseVector => topics.toArray.zipWithIndex
}.select("journal", "topic")
val df2 = df1.withColumn("topic_id", df1("topic").getItem("_2")).withColumn("topic_prob", df1("topic").getItem("_1")).drop(df1("topic"))

But dataset.explode is deprecated. I wonder how to achieve this using flatmap method?

eliasah
  • 39,588
  • 11
  • 124
  • 154
Kyle Wang
  • 31
  • 2

1 Answers1

1

Not tested but should work:

import spark.implicits._
import org.apache.spark.ml.linalg.Vector

df.as[(String, Vector)].flatMap { 
  case (j, ps) => ps.toArray.zipWithIndex.map { 
    case (p, ti) => JournalDis(j, ti, p)
  }
}
user7337271
  • 1,662
  • 1
  • 14
  • 23
  • Error:(38, 20) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. val df1 = df.as[(String, DenseVector)].flatMap { – Kyle Wang Dec 26 '16 at 01:38
  • The signature of flatMap: – Kyle Wang Dec 26 '16 at 01:42
  • def flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U] Permalink (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results. – Kyle Wang Dec 26 '16 at 01:43
  • Did you `import spark.implicits._ `? – user7337271 Dec 26 '16 at 02:07
  • No, I didn't import spark.implicits._ – Kyle Wang Dec 27 '16 at 08:26