0

I want to write a Spark Streaming Job from Kafka to Elasticsearch. Here I want to detect the schema dynamically while reading it from Kafka.

Can you help me to do that.?

I know, this can be done in Spark Batch Processing via below line.

val schema = spark.read.json(dfKafkaPayload.select("value").as[String]).schema

But while executing the same via Spark Streaming Job, we cannot do the above since streaming can have only on Action.

Please let me know.

Siva Samraj
  • 37
  • 1
  • 5

1 Answers1

1

If you are listening from kafka topic you can not rely on spark to automaticly infer json schema since it will take a lot of time. So somehow you need to provide your schema to your application.

If you are listening from file source you can do that though.

'spark.sql.streaming.schemaInference', 'true'
Enes Uğuroğlu
  • 377
  • 5
  • 16
  • Question states data is from Kafka source, not a file. Kafka source is always bytes – OneCricketeer Dec 15 '21 at 14:59
  • Hello, OneCricketeer, I already post my answer about kafka source, that was only additional information for adhoc type of schema inference :) – Enes Uğuroğlu Dec 15 '21 at 15:09
  • 1
    Sorry, got confused when answer included file source... In any case, I think the true answer is to not put "dynamic" json into a Kafka topic at all; the schema of the producer should remain consistent – OneCricketeer Dec 15 '21 at 15:13