7

I use Spark 2.1.

I am trying to read records from Kafka using Spark Structured Streaming, deserialize them and apply aggregations afterwards.

I have the following code:

SparkSession spark = SparkSession
        .builder()
        .appName("Statistics")
        .getOrCreate();

Dataset<Row> df = spark
        .readStream()
        .format("kafka")
        .option("kafka.bootstrap.servers", kafkaUri)
        .option("subscribe", "Statistics")
        .option("startingOffsets", "earliest")
        .load();

df.selectExpr("CAST(value AS STRING)")

What I want is to deserialize the value field into my object instead of casting as String.

I have a custom deserializer for this.

public StatisticsRecord deserialize(String s, byte[] bytes)

How can I do this in Java?


The only relevant link I have found is this https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html, but this is for Scala.

zero323
  • 322,348
  • 103
  • 959
  • 935
dchar
  • 1,665
  • 2
  • 19
  • 28

2 Answers2

4

Define schema for your JSON messages.

StructType schema = DataTypes.createStructType(new StructField[] { 
                DataTypes.createStructField("Id", DataTypes.IntegerType, false),
                DataTypes.createStructField("Name", DataTypes.StringType, false),
                DataTypes.createStructField("DOB", DataTypes.DateType, false) });

Now read Messages like below. MessageData is JavaBean for your JSON message.

Dataset<MessageData> df = spark
            .readStream()
            .format("kafka")
            .option("kafka.bootstrap.servers", kafkaUri)
            .option("subscribe", "Statistics")
            .option("startingOffsets", "earliest")
            .load()
            .selectExpr("CAST(value AS STRING) as message")
            .select(functions.from_json(functions.col("message"),schema).as("json"))
            .select("json.*")
            .as(Encoders.bean(MessageData.class));  
abaghel
  • 14,783
  • 2
  • 50
  • 66
  • 1
    The schema is correctly applied, but I get null values for all columns. I am trying to read the columns as df.createOrReplaceTempView("data"); StreamingQuery query = spark.sql("SELECT * FROM data").writeStream().format("console").start(); Am I doing something wrong? – dchar May 12 '17 at 14:05
  • You can read the Dataset df directly like below. df.writeStream().format("console").start(); – abaghel May 12 '17 at 14:30
  • 1
    This produced the exact same results. I see the top 20 rows with "null" in all columns. – dchar May 12 '17 at 14:36
  • 1
    `null` are when `from_json` could not deJSONify the input. – Jacek Laskowski May 12 '17 at 14:39
  • Thanks Jacek Laskowski. @dchar Please check your kafka messages as from_json function says "Returns `null`, in the case of an unparseable string." – abaghel May 12 '17 at 14:48
  • I have JSON with hundreds of fields, how is it feasible to create struct for that large pojo ? – thedevd Jul 19 '18 at 09:49
  • 1
    @dchar I have faced same issue and found out that i was passing doubletype in json and reading the value as integertype , so please check the types you are passing and reading. – vamshi palutla Aug 10 '18 at 09:48
2

If you have a custom deserializer in Java for your data, use it on bytes that you get from Kafka after load.

df.select("value")

That line gives you Dataset<Row> with just a single column value.


I'm exclusively with Spark API for Scala so I'd do the following in Scala to handle the "deserialization" case:

import org.apache.spark.sql.Encoders
implicit val statisticsRecordEncoder = Encoders.product[StatisticsRecord]
val myDeserializerUDF = udf { bytes => deserialize("hello", bytes) }
df.select(myDeserializerUDF($"value") as "value_des")

That should give you what you want...in Scala. Converting it to Java is your home exercise :)

Mind that your custom object has to have an encoder available or Spark SQL will refuse to put its objects inside a Dataset.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420