2

We run a sequence of around 80 batch jobs with more than half partitioned with up to 50 partitions. As far as I can tell, the only non-standard thing we do is to disable auto-startup. The gateway is start and stop are managed by step listener. This is fine for the majority of time but we are seeing an occasional failure. I increased the logging and see all messages sent out with the correlationId. The stack trace happens after the remote partitions end (in this case ...about 3 minutes) :

2016-01-19 22:19:01,517 DEBUG [org.springframework.integration.jms.JmsOutboundGateway] (springbatch.partitioned.jms.taskExecutor-38) policy.estimatepayroll.outbound-gateway Sending message with correlationId d1025dfd-3551-4df8-96a7-043364c52e3d_18


2016-01-19 22:21:55,240 WARN  [org.springframework.integration.jms.JmsOutboundGateway] (org.springframework.integration.jms.JmsOutboundGateway#0.replyListener-1) Failed to consume reply with correlationId d1025dfd-3551-4df8-96a7-043364c52e3d_18
    java.lang.RuntimeException: No sender waiting for reply
        at org.springframework.integration.jms.JmsOutboundGateway.onMessage(JmsOutboundGateway.java:945)
        at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:562)
        at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:500)
        at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:468)
        at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:326)
        at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:264)
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1069)
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1061)
        at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:958)
        at java.lang.Thread.run(Unknown Source)

The problem is that when this happens, the onMessage() method throws a RuntimeException which kills the thread. Subsequent jobs have less threads to use and as a result some partitions execute in series instead of parallel.

I have reviewed the code and can't find how this problem can occur? Could it be because the replies is a HashMap and not thread safe?

Thanks for any help / suggestions.

Mike Rother
  • 591
  • 4
  • 16

1 Answers1

2

The most likely cause is the receive-timeout is too low - so the sending thread times out and is no longer waiting for the reply.

The default timeout is quite low (5 seconds).

EDIT

Sorry, forgot that this was recently fixed (in 4.2 and 4.1).

We've back-ported but not yet released 4.0.x or 3.0.x with the fix.

JIRA Here.

Gary Russell
  • 166,535
  • 14
  • 146
  • 179
  • should be set to this policy.estimatepayroll.partitioned.timeout=500000000 in both the outbound gateway receive-timeout and the partition handler – Mike Rother Jan 20 '16 at 21:53
  • Sorry - didn't read your last paragraph properly. We recently fixed that. – Gary Russell Jan 20 '16 at 21:58
  • we are on spring batch 2.2.7 and spring integration 2.2.6 I assume you fixed in a new version. Can I easily migrate to my code base? – Mike Rother Jan 20 '16 at 22:00
  • Yikes - we didn't backport all the way back to 2.2.x. I edited the answer with the versions. Batch 2.2.x will probably work with 3.0.x but we haven't released 3.0.9 with the fix yet. We can do so and/or backport back to 2.2.x as well) – Gary Russell Jan 20 '16 at 22:03
  • Thanks for your support and extremely timely responses – Mike Rother Jan 20 '16 at 22:07
  • I backported the fix; [PR here](https://github.com/spring-projects/spring-integration/pull/1695). When it's merged we can build 2.2.7. – Gary Russell Jan 20 '16 at 22:26
  • The fix is on master but we haven't built that branch for a couple of years now and it seems some incompatibilities with the build system have crept in. Watch this space. – Gary Russell Jan 20 '16 at 23:14
  • sorry to be a pest but any timeframe on this release ? Is there a way I can access with a build file and build locally? – Mike Rother Jan 21 '16 at 21:21
  • We just pushed 2.2.7 to maven central; it doesn't show up in search.maven.org yet (it usually takes a few hours), but [according to bintray](https://bintray.com/spring/jars/org.springframework.integration/2.2.7.RELEASE/view#central) it was synced ok with maven central. If you can't get it from maven central you can get it from `http://repo.spring.io/release` – Gary Russell Jan 21 '16 at 21:41