So I've got this example code where I have a Dataset[Event]
which I would like to group based on a key of generic type computed on the fly.
import org.apache.spark.sql.{ Dataset, KeyValueGroupedDataset }
case class Event(id: Int, name: String)
trait Key
case class NameKey(name: String) extends Key
abstract class EventProc[K <: Key] {
def key(e: Event): K
def group(ds: Dataset[Event]): KeyValueGroupedDataset[K, Event] = {
import ds.sparkSession.implicits._
ds.groupByKey { e => key(e) }
}
}
class NameEventProc extends EventProc[NameKey] {
def key(e: Event): NameKey = NameKey(e.name)
}
The idea is I should be able to do a new NameEventProc().group(ds)
or similar on a different class which extends EventProc
. But the code does not even compile and fails with the error.
<console>:26: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
ds.groupByKey { e => key(e) }
^
From what I gathered, Spark is unable to understand the what type K
is and is hence unable to use the appropriate encoder. But I am not sure how to fix this.