What is the difference between chunk size and partition size in spring batch?

Question

I am not referring to spring batch partitioning which is explained briefly here.

I am referring to DEFAULT_PARTITION_SIZE which is also supported by spring batch. I am setting the value of this property as below -

jobExecution.getExecutionContext().put(
        "DEFAULT_PARTITION_SIZE",
        300
    );

For my project I have chunk size of 25 and partition size of 300. I want to know what is the difference between these two. I understand Chunk Size refers to reading the data one at a time and creating 'chunks' that are written out within a transaction boundary. But there is not much explanation for Partition size in spring batch docs or on internet.

With chunk size of 25 and partition size of 300, i was expecting that 25 records would be written to the output file in each go. But in actual 300 records are getting written to output file in each go. why is this.

score 0 · Answer 1 · answered Sep 02 '23 at 20:45

From this link and documentation, it seems that chunk Size dictates the number of items processed before committing a batch. If set to 25, the reader reads 25 items, the processor processes them, and then the writer writes those 25 items in one transaction. while the DEFAULT_PARTITION_SIZE seems to be a custom or less-documented parameter, likely dictating how many records are in a single partition. If set to 300, each partition would comprise 300 records. You can try ask in aws batch support form, if you are sure it doesn't appear there

Ken Chan · Answer 2 · 2023-09-03T09:18:02.360

The partition size is also known as grid size which will somehow input to the StepExecutionSplitter to process which I quote its meaning from the javadoc as follows :

Partition the provided StepExecution into a set of parallel executable instances with the same parent JobExecution. The grid size will be treated as a hint for the size of the collection to be returned. It may or may not correspond to the physical size of an execution grid.

Basically it means you want to break down the input data of a job into different partitions. All data in a partition will then be processed by a separate step instance that you define. How these partition is processed depends what which PartitionHandler implementation you use. Normally you would like to execute a step for each partition concurrently in AsyncTaskExecutor to improve the performance.

Suppose the job has 30000 data to process , you divide them into 300 partitions (i.e. partition size = 300) as follows :

Partition 1 processes item 1 to item 100
Partition 2 processes item 101 to item 200
....
.....
Partition 299 processes item 29801 to item 29900
Partition 300 processes item 29901 to item 30000

Then items in each partition will be executed in its own step instance (ChunkOrientedTasklet). If chunk size is 25 , it means step in each partition will process 25 items batch by batch.

With chunk size of 25 and partition size of 300, i was expecting that 25 records would be written to the output file in each go. But in actual 300 records are getting written to output file in each go. why is this.

It totally depends on your implementation such as how you implement Partitioner , how you access the data that you setup in the ExecutionContext in the step etc. As you do not provide codes about that , no one know it.

You can refer to this for more details about how partitioning works in spring batch.

What is the difference between chunk size and partition size in spring batch?

2 Answers2