How does Databricks Autoloader split data in microbatches?

Asked Apr 15 '23 at 06:48

Active Apr 15 '23 at 08:50

Viewed 127 times

Based on this, Databricks Runtime >= 10.2 supports the "availableNow" trigger that can be used in order to perform batch processing in smaller distinct microbatches, whose size can be configured either via total number of files (maxFilesPerTrigger) or total size in bytes (maxBytesPerTrigger). For my purposes, I am currently using both with the following values:

maxFilesPerTrigger = 10000
maxBytesPerTrigger = "10gb"

My daily batch includes a number of files in the range of 15,000 to 20,000. Having set a pretty high "maxBytesPerTrigger" limit, at least in relation to my data, I'd expect that during each batch process, there would form two microbatches, the first one containing 10,000 files, and the second containing the rest. However, it is always the case that three microbatches are formed, withe the first one containing a pretty small number of files (about 500 or so), the second one containing 10,000 files, and the third one containing the rest, usually 5,000 to 10,000 files.

Does anyone have an idea as to why there are three microbatches instead of two, with the first one always containing a small number of files?

edited Apr 15 '23 at 08:50

asked Apr 15 '23 at 06:48

werden_wissen

How does Databricks Autoloader split data in microbatches?

0 Answers0