We run a sequence of around 80 batch jobs with more than half partitioned with up to 50 partitions. As far as I can tell, the only non-standard thing we do is to disable auto-startup. The gateway is start and stop are managed by step listener. This is fine for the majority of time but we are seeing an occasional failure. I increased the logging and see all messages sent out with the correlationId. The stack trace happens after the remote partitions end (in this case ...about 3 minutes) :
2016-01-19 22:19:01,517 DEBUG [org.springframework.integration.jms.JmsOutboundGateway] (springbatch.partitioned.jms.taskExecutor-38) policy.estimatepayroll.outbound-gateway Sending message with correlationId d1025dfd-3551-4df8-96a7-043364c52e3d_18
2016-01-19 22:21:55,240 WARN [org.springframework.integration.jms.JmsOutboundGateway] (org.springframework.integration.jms.JmsOutboundGateway#0.replyListener-1) Failed to consume reply with correlationId d1025dfd-3551-4df8-96a7-043364c52e3d_18
java.lang.RuntimeException: No sender waiting for reply
at org.springframework.integration.jms.JmsOutboundGateway.onMessage(JmsOutboundGateway.java:945)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:562)
at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:500)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:468)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:326)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:264)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1069)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1061)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:958)
at java.lang.Thread.run(Unknown Source)
The problem is that when this happens, the onMessage() method throws a RuntimeException which kills the thread. Subsequent jobs have less threads to use and as a result some partitions execute in series instead of parallel.
I have reviewed the code and can't find how this problem can occur? Could it be because the replies is a HashMap and not thread safe?
Thanks for any help / suggestions.