0

Based on this, Databricks Runtime >= 10.2 supports the "availableNow" trigger that can be used in order to perform batch processing in smaller distinct microbatches, whose size can be configured either via total number of files (maxFilesPerTrigger) or total size in bytes (maxBytesPerTrigger). For my purposes, I am currently using both with the following values:

maxFilesPerTrigger = 10000
maxBytesPerTrigger = "10gb"

My daily batch includes a number of files in the range of 15,000 to 20,000. Having set a pretty high "maxBytesPerTrigger" limit, at least in relation to my data, I'd expect that during each batch process, there would form two microbatches, the first one containing 10,000 files, and the second containing the rest. However, it is always the case that three microbatches are formed, withe the first one containing a pretty small number of files (about 500 or so), the second one containing 10,000 files, and the third one containing the rest, usually 5,000 to 10,000 files.

Does anyone have an idea as to why there are three microbatches instead of two, with the first one always containing a small number of files?

0 Answers0