0

Internet is filled with examples of streaming data from Kafka topic to delta tables. But my requirement is to stream data from Delta Table to Kafka topic. Is that possible? If yes, can you please share code example? Here is the code I tried.

val schemaRegistryAddr = "https://..."
val avroSchema = buildSchema(topic) //defined this method

val Df = spark.readStream.format("delta").load("path..")
  .withColumn("key", col("lskey").cast(StringType))
  .withColumn("topLevelRecord",struct(col("col1"),col("col2")...)
  .select(
          to_avro($"key", lit("topic-key"), schemaRegistryAddr).as("key"),
          to_avro($"topLevelRecord", lit("topic-value"), schemaRegistryAddr, avroSchema).as("value"))


Df.writeStream
  .format("kafka")
  .option("checkpointLocation",checkpointPath)
  .option("kafka.bootstrap.servers", bootstrapServers)
  .option("kafka.security.protocol", "SSL")
  .option("kafka.ssl.keystore.location", kafkaKeystoreLocation)
  .option("kafka.ssl.keystore.password", keystorePassword)
  .option("kafka.ssl.truststore.location", kafkaTruststoreLocation)
  .option("topic",topic)
  .option("batch.size",262144)
  .option("linger.ms",5000)
  .trigger(ProcessingTime("25 seconds"))
  .start()

But it fails with: org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403

But when I try to write to the same topic using a Batch Producer it goes through successfully. Can anyone please let me know what am I missing in the streaming write to Kafka topic?

Later I found this old blog which says that current Structured Streaming API does not support 'kafka' format. https://www.databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html?_ga=2.177174565.1658715673.1672876248-681971438.1669255333

Oli
  • 9,766
  • 5
  • 25
  • 46
Don Sam
  • 525
  • 5
  • 20
  • Can you try using Delta Lake's CDC feature so you can stream the change data capture stream to Kafka? – Denny Lee Jan 05 '23 at 04:06
  • I don't know how's that going to help, but I will look into it. Later today I found an old databricks blog where it's mentioned that current Spark Structured Streaming API does not support Kafka format. https://www.databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html?_ga=2.177174565.1658715673.1672876248-681971438.1669255333 – Don Sam Jan 05 '23 at 08:02
  • 3
    Databricks works just fine with Kafka for both writes & reads. The problem here is about Schema Registry support... Do you have authentication enabled on the schema registry? – Alex Ott Jan 05 '23 at 10:36
  • @AlexOtt But I am able to write to the same Kafka topic in Batch Mode. It doesn't give any SchemaNotFound exception .The problem starts only when I try to write to Kafka topic in streaming mode i.e. using writeStream – Don Sam Jan 06 '23 at 04:55

0 Answers0