0

I am trying to use structured streaming in Spark 2.1.1 to read from Kafka and decode Avro encoded messages. I have the a UDF defined as per this question.

val sr = new CachedSchemaRegistryClient(conf.kafkaSchemaRegistryUrl, 100)
val deser = new KafkaAvroDeserializer(sr)

val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }

val topic = conf.inputTopic
val df = session
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", conf.kafkaServers)
    .option("subscribe", topic)
    .load()

df.printSchema()

val result = df.selectExpr("CAST(key AS STRING)", """decodeMessage($"value") as "value_des"""")

val query = result.writeStream
    .format("console")
    .outputMode(OutputMode.Append())
    .start()

However I get the following failure.

Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type DeviceRelayStateEnum is not supported

It fails on this line

val decodeMessage = udf { bytes:Array[Byte] => deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead] }

An alternate approach was to define encoders for the custom classes I have

implicit val enumEncoder = Encoders.javaSerialization[DeviceRelayStateEnum]
implicit val messageEncoder = Encoders.product[DeviceRead]

but that fails with the following error when the messageEncoder is getting registered.

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for DeviceRelayStateEnum
- option value class: "DeviceRelayStateEnum"
- field (class: "scala.Option", name: "deviceRelayState")
- root class: "DeviceRead"
    at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
    at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:476)
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
    at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)

When I attempt to do this using a map after the load() I get the following compilation error.

val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])

Error:(76, 26) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[DeviceRead])org.apache.spark.sql.Dataset[DeviceRead].
Unspecified value parameter evidence$6.
      val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])
Error:(76, 26) Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
      val result = df.map((bytes: Row) => deser.deserialize("topic", bytes.getAs[Array[Byte]]("value")).asInstanceOf[DeviceRead])

Does that essentially mean that I cannot use Structured Streaming for Java enums? And it can only be used with either primitives or case classes?

I read a few related questions 1, 2, 3 around this and it seems the possibility of specifying a custom Encoder for a class i.e. UDT was removed in 2.1 and the new functionality was not added.

Any help will be appreciated.

zero323
  • 322,348
  • 103
  • 959
  • 935
Saket
  • 3,079
  • 3
  • 29
  • 48

1 Answers1

1

I think you may be asking for too much in the current version of Structured Streaming (and Spark SQL) in general.

I've been yet unable to fully understand how to deal with the issue of missing encoders in a so-called more professional way, but the same issue you'd get when you tried to create a Dataset of enums. That might not simply be supported yet.

Structured Streaming is just a streaming library on top of Spark SQL and uses it for serialization-deserialization (SerDe).

To make the story short and to get you going (until you figure out a better way), I'd recommend avoid using enums in the business objects you use to represent the schema of your datasets.

So, I'd recommend doing something along the lines:

val decodeMessage = udf { bytes:Array[Byte] =>
  val dr = deser.deserialize("topic.name", bytes).asInstanceOf[DeviceRead]

  // do additional transformation here so you use a custom streaming-specific class
  // Here I'm using a simple tuple to hold what might be relevant
  // You could create a case class instead to have proper names
  (dr.id, dr.value)
}
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • The trouble is that we used enums and the above defined formats using the direct streaming approach in Spark 1.6. So we can't use something else without impacting other applications. I'll look around if I can find something else. – Saket Jun 23 '17 at 14:15
  • The temporary mapping is only to process data using Spark (SQL and Structured Streaming) and no external application could even know about the internal "hiccup". Direct streaming approach is different since it does not use Spark SQL's encoders. I'd love hearing more about your concerns to offer a better solution. Mind elaborating more updating your question? – Jacek Laskowski Jun 23 '17 at 14:22