2

In Spark streaming, I am getting logs as they arrive. But I want to get at least N number of logs in a single pass. How can it be achieved?

From this answer, it appears there is such a utility in Kafka but doesn't seem to be present in Spark to make it possible.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
Mr. Sigma.
  • 405
  • 1
  • 4
  • 15

1 Answers1

1

There is no option that allows you to set a minimum value for the number of messages received from Kafka. The option maxOffsetsPerTrigger allows you to set a maximum of messages.

If you want your micro-batch to process more messages at once, you may consider increasing your trigger interval.

Also (referring to the link you have provided), this is also not possible to set in Kafka itself. You could set a minimum amount of fetched bytes but not a minimum amount of message numbers.

Note, that you can set all Kafka options through the readStream in Structured Streaming through the prefix kafka. as explained in the section Kafka Specific Configurations:

"Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g, stream.option("kafka.bootstrap.servers", "host:port")."

That way, you could also play around with the Consumer Configuration kafka.fetch.min.bytes. However, testing this with Spark 3.0.1 on a loval Kafka 2.5.0 installation it does not have any impact. When adding the configuration kafka.fetch.max.wait.ms the fetch timing in my tests did changed but not in a predictable manner (at least to me).

Looking at the source code of Spark's KafkaDataConsumer it looks like the fetch does not directly account for any min/max bytes compared to the pure KafkaConsumer.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • Thanks. I tried to workaround option("kafka.fetch.min.bytes","1000") but it's not working though there is no error also. – Mr. Sigma. Jan 22 '21 at 08:47
  • There are multiple JSON strings that are being pushed by a producer. I want only 5-6 JSON strings to be fetched in a single go. So, I kept `option("kafka.fetch.min.bytes","100000")` to check if the consumer is still fetching JSON string only to find that it's still doing that. i.e, it's not implementing the given expression. – Mr. Sigma. Jan 22 '21 at 08:59
  • 1
    @Mr.Sigma. apologies, I was testing this today and also realised that this configuration does not have any impact. It looks like only the Trigger time and maxOffsetsPerTrigger have any direct impact on the Kafka fetcher. Sorry, for the confusion, I have updated my answer accordingly. – Michael Heil Jan 22 '21 at 21:15