0

So firstly is it possible to writestream to the same kafka topic using two different streaming queries? If yes, then how to readstream on such a topic? Thanks Reference Code snippet

 val StreamingQuery1 = DataFrame1.selectExpr("to_json(struct(*)) AS value")
        .writeStream
        .format("kafka")
        .option("topic", Topic)
        .queryName("Query1")
        .option("kafka.bootstrap.servers", kafkaBootstrapServer)
        .option("checkpointLocation",checkpointPath)
        .option("kafka.sasl.mechanism", "PLAIN")
        .option("kafka.security.protocol", "SASL_SSL")
        .option("kafka.sasl.jaas.config", saslJaasCfg)
        .option("kafka.timeout.ms", 18000)
        .option("kafka.request.timeout.ms", 18000)
        .option("kafka.session.timeout.ms", 18000)
        .option("kafka.heartbeat.interval.ms", 18000)
        .option("kafka.retries", 100)
        .option("failOnDataLoss", "false")
        .option("truncate", false)
        .start()

 val StreamingQuery2 = DataFrame2.selectExpr("to_json(struct(*)) AS value")
        .writeStream
        .format("kafka")
        .option("topic", Topic)
        .queryName("Query2")
        .option("kafka.bootstrap.servers", kafkaBootstrapServer)
        .option("checkpointLocation",checkpointPath)
        .option("kafka.sasl.mechanism", "PLAIN")
        .option("kafka.security.protocol", "SASL_SSL")
        .option("kafka.sasl.jaas.config", saslJaasCfg)
        .option("kafka.timeout.ms", 18000)
        .option("kafka.request.timeout.ms", 18000)
        .option("kafka.session.timeout.ms", 18000)
        .option("kafka.heartbeat.interval.ms", 18000)
        .option("kafka.retries", 100)
        .option("failOnDataLoss", "false")
        .option("truncate", false)
        .start()
        .awaitTermination()
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245

1 Answers1

0

Yes, it's possible to write into the same Kafka topic from multiple streaming queries. And yes, it's possible to read this data - you just need to have one readStream for topic (or list of topics if you need).

The beauty of Kafka (and other similar system) that it decouples the producers & consumers and you can have 1:N, N:1, N:M combinations if you need.

Update after receiving the code:

That could be a problem with checkpoint location, as both writeStream operations are pointing to the same place: .option("checkpointLocation",checkpointPath)

also, instead of waiting for specific stream finished, it could be better to do spark.streams.awaitAnyTermination() and check which of streams has finished.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks Alex Ott. Do I need to put some some extra options in both the writestream queries? Because what I am noticing is when I am using awaitAnyTermination(), the spark application succeeds after few minutes(which ideally shouldn't since its a streaming job and should run indefinitely) and when I am using awaitTermination() on second query, the first query finishes and only second keeps on running. I am badly stucked here. – Nitish Joshi Apr 03 '21 at 08:19
  • it may depends on different factors, like, if you're using Trigger.Once, or not. It's hard to say without looking into an actual code. – Alex Ott Apr 03 '21 at 08:32
  • @Aex Ott. I have edited the answer to include a code snippet as I was unable to put code in the comments. Can you please have a look? – Nitish Joshi Apr 03 '21 at 09:12
  • Thanks, that worked(that was one bad silly mistake), both the queries are running fine now. But now the thing is when I am doing readstream, I can see messages from query1 only, though both are running fine using FAIR scheduling – Nitish Joshi Apr 03 '21 at 10:41
  • you can put the queries into different pools: https://docs.databricks.com/spark/latest/structured-streaming/production.html#configure-apache-spark-scheduler-pools-for-efficiency - this may happen because you didn't configure trigger, so next job for query1 is starts right after the previous finished. You can set trigger to process data every N seconds, so it will be possible for 2nd query to get its share as well – Alex Ott Apr 03 '21 at 10:44
  • Tried it, it isn't working. Only query1 is producing output. The issue looks similar to https://stackoverflow.com/questions/63075499/running-multiple-spark-kafka-structured-streaming-queries-in-same-spark-session Does this means it isn't possible to push messages to the same Kafka topic using two different queries? – Nitish Joshi Apr 03 '21 at 12:57