10

We have some historical data queued up on our topics, we don't want to process all this data in a single batch as that is harder to do (and if it fails it has to start again!).

Also, knowing how to control the batch size would be quite helpful in tuning jobs.

When using DStreams the way to control the size of the batch as exactly as possible is Limit Kafka batches size when using Spark Streaming

The same approach i.e. setting maxRatePerPartition and then tuning batchDuration is extremely cumbersome but works with DStream it doesn't work at all with Structured Streaming.

Ideally I'd like to know of a config like maxBatchSize and minBatchSize, where I can simply set the number of records I'd like.

zero323
  • 322,348
  • 103
  • 959
  • 935
samthebest
  • 30,803
  • 25
  • 102
  • 142

2 Answers2

11

This config optionmaxOffsetsPerTrigger:

Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.

Note that if you have a checkpoint directory with start and end offsets, then the application will process the offsets in the directory for the first batch, thus ignoring this config. (The next batch will respect it).

samthebest
  • 30,803
  • 25
  • 102
  • 142
10465355
  • 4,481
  • 2
  • 20
  • 44
  • Will this throttle my job? What exactly is a trigger interval? Will it read from Kafka as quickly as possible, but just limit the number of records read? – samthebest Oct 25 '18 at 07:10
  • Can use this instead, and handle offsets yourself, which will be more predictable/flexible than StructureStreaming. https://stackoverflow.com/a/53065951/1586965 – samthebest May 13 '19 at 16:03
  • @samthebest This works well to limit `batchSize`. What option shall be used to limit/control `trigger-frequency`? (Something similar to `Duration.class` in Spark Streaming). – CᴴᴀZ Aug 22 '19 at 05:27
  • 2
    @CᴴᴀZ Concept you're looking for is [trigger](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers), specifically _Fixed interval micro-batches_. – 10465355 Aug 22 '19 at 17:47
0

If the topic is partitioned and all the partitions has messages, the minimum messages you can take is equal to the number of partitions in the topic. (ie) it takes 1 record per partition if it has data, if only one partition has data then the minimum record you can take is 1. If the topic is not partitioned you can take 1 record minimum and anything as maximum.

Raptor0009
  • 258
  • 4
  • 14