This has been asked here, but I don't think this was answered. The only answer talks about how aggregator uses correlationId. But the real issue is how job status is updated without checking JobExecutionId in replies. I don't have enough reputation to comment on existing question, so asking here again.
According to javadoc on MessageChannelPartitionHandler
it is supposed to be step or job scoped. In remote partitioning scenario we are using RemotePartitioningManagerStepBuilder
to build manager step which does not allow to set PartitionHandler. Given that every job will use same queue on rabbitmq, when worker node replies are received message are getting crossed. There is no simple way to reproduce this but I can see this behavior using some manual steps as below
- Launch first job
- Kill the manager node before worker can reply
- Let worker node finish handling all partitions and send a reply on rabbitmq
- Start manager node again and launch a new job
- Have some mechanism to fail the second job i.e. explicitly fail in reader/writer
- Check the status of 2 jobs
Expected Result: Job-1 marked completed and job-2 as failed
Actual Result: Job-1 remains in started and job-2 is marked completed eventhough its worker steps are marked as failed
Below is sample code that shows how manager and worker steps are configured
@Bean
public Step importDataStep(RemotePartitioningManagerStepBuilderFactory managerStepBuilderFactory) {
return managerStepBuilderFactory.get()
.<String, String>partitioner("worker", partitioner())
.gridSize(2)
.outputChannel(outgoingRequestsToWorkers)
.inputChannel(incomingRepliesFromWorkers)
.listener(stepExecutionListener)
.build();
}
@Bean
public Step worker(
RemotePartitioningWorkerStepBuilderFactory workerStepBuilderFactory) {
return workerStepBuilderFactory.get("worker")
.listener(stepExecutionListener)
.inputChannel(incomingRequestsFromManager())
.outputChannel(outgoingRepliesToManager())
.<String, String>chunk(10)
.reader(itemReader())
.processor(itemProcessor())
.writer(itemWriter());
}
Alternatively, I can think of using polling instead of replies where crossing of message does not occur. But polling cannot be restarted if manager nodes crashed while worker nodes were processing. If I follow the same above steps using polling
Actual Result: Job-1 remains in started and job-2 is marked failed as expected
This issue does not occur in case of polling because each Poller is using exact jobExecutionId to poll and update corresponding manger step/job.
What am I doing wrong? Is there a better way to handle this scenario?