1

We have a JEE app that uses about 40 partitioned jobs on a cluster. It can be deployed on both JBoss and WebSphere. We are experiencing 2 problems:

  • messaging system failures in both JBoss and WebSphere typically related to temporary queue connection problems

  • partitioned jobs effectively hung because of lost messages.

I read a posting that switching the reply-destination of the outbound-gateway can improve robustness and allow for re-connection in the case of failures. The inbound-gateway basically starts 2 listeners on the requestQueue.

<int-jms:inbound-gateway id="springbatch.inbound.gateway" 
                connection-factory="springbatch.jmsConnectionFactory" 
                request-channel="springbatch.slave.jms.request" 
                request-destination="requestsQueue" 
                reply-channel="springbatch.slave.jms.response" 
                concurrent-consumers="2" 
                max-concurrent-consumers="2"/> 

Each job has a separate outbound-channel.

<int-jms:outbound-gateway 
    connection-factory="springbatch.jmsConnectionFactory" 
    request-channel="jms.channel.1" 
    request-destination="requestsQueue" 
    reply-channel="jms.channel.2" 
    reply-destination="repliesQueue"
    correlation-key="JMSCorrelationID" >
    <int-jms:reply-listener />        
</int-jms:outbound-gateway>

It runs fine on a single server but when run on a cluster the partitions run around the cluster but the master step does not get acknowledgement. I thought the JMSCoordinationID as the correlation-key would handle matching up the JMS messages.

Am I missing a configuration piece?

Gary Russell
  • 166,535
  • 14
  • 146
  • 179
Mike Rother
  • 591
  • 4
  • 16
  • I have implemented the approach discussed by Gary below to utilize a StepListener and start the outbound-gateway at the start of the partitioned step. The first time the partitioned step (local and remote) both complete. The second run of the same job, the remote step completes and the local or partition step does not. I looked in the JBoss reply queue and the message is there. I checked the method isRunning() and it says true – Mike Rother Dec 30 '14 at 19:23
  • I ran some more tests and notice that after each JBoss server restart the first attempt at any job succeeds. Any subsequent attempt on any job leaves the messages in the queue like no one is listening. Is calling start on the gateway only starting the listener for the channel and not the queue? – Mike Rother Dec 30 '14 at 22:05
  • Here are more observations. At JBoss startup, I see there are 0 consumers on the replies queue (from JBoss JMX). When I run the batch the first time is succeeds but it leaves 1 consumer on the replies queue. When I run the batch job a second time the number of consumers stays at 1. In the debugger is waiting on the line in the JmsOutboundGateway reply = replyQueue.poll(this.receiveTimeout, TimeUnit.MILLISECONDS); of the obtainReplyFromContainer() method. – Mike Rother Dec 30 '14 at 23:58
  • Is there a possible compatibility problem with Spring Batch 2.1.8, Spring Integration 2.2.0, and Spring JMS and Framework at 3.2.0? – Mike Rother Dec 31 '14 at 00:00

1 Answers1

1

What you have should work; in that mode, the correlation id is set to gatewayId+n (where gatewayId is a UUID an n increments). The reply container message selector is set to JMSCorrelationID LIKE gatewayId% so step execution results should be correctly routed back to the master. I suggest you turn on DEBUG logging and follow the messages on both sides to see what's happening.

EDIT:

Re: Sharing JMS Endpoints (comment below).

It can be done, but would need a little restructuring.

On the producer (master) side, the gateway and a stand-alone aggregator would have to move to a parent context (with each job context being a child of it). Since the partition handler has to be in the child context, you would need a separate aggregator class; that said, the aggregation is orthogonal to the partitioning, it's just in that bean for convenience. A common aggregator is fine because it uses the partition handler's correlation id for the job execution and the reassembled step execution results will be routed to the right partition handler.

The consumer (slave) side is a bit more tricky because if the inbound gateway is in a single (parent) context) it won't have visibility to the stepExecutionRequestHandlers' channels in the child context; you would need to build a router to route the requests to the appropriate job contexts. Not impossible, just a bit more work.

The dynamic-ftp Spring Integration sample and its README is a good starting point.

Gary Russell
  • 166,535
  • 14
  • 146
  • 179
  • I was able to get working on a single server but still confirming on the cluster. – Mike Rother Dec 22 '14 at 17:34
  • On a related point since we have 40+ jobs, to minimize resource usage, can we reuse the outbound gateway or listeners ? (or not because has to go to a single aggregator?) I am seeing 96 listeners on the cluster queue. – Mike Rother Dec 22 '14 at 17:36
  • It appears that for each JMSOutboundGateway defined ( with a reply-destimation defined), creates the listeners at startup of the server. So we see the 48 X 2 listeners (48 jobs and 2 listeners each) created on startup. I expected that the listeners would only be created when the outbound gateways would send the message. Are they created at startup instead for efficiency? – Mike Rother Dec 22 '14 at 22:12
  • By default `auto-startup="true"`. You can override it to false, but you'll need to `start()` them, perhaps from an earlier job step (tasklet) and `stop()` them when the partitioned step completes. You can either get a reference to gateway (`EventDrivenConsumer`) via normal Spring bean wiring (`@AutoWired` etc), or use a `` and send `@gatewayId.start()` to it. – Gary Russell Dec 22 '14 at 22:22
  • I added a [JIRA Issue](https://jira.spring.io/browse/INT-3587) to add a lazy start option. – Gary Russell Dec 22 '14 at 22:59
  • If I understand your proposed approach, I am missing a couple of pieces of information. It sounds like I could add a step listener for each partitioned job that would start and stop the outbound gateway in the before step and after step methods. I am missing how I can access the gateway from the step listener. Or does it not have enough execution in the StepListener? In that case I was thinking if necessary I could extend the partition step class and add start and stop methods. – Mike Rother Dec 23 '14 at 19:38
  • Yes, a `StepExecutionListener` should work just fine. You should be able to inject the adapter (endpoint) into your `StepExecutionListener` bean. Use the normal Spring Bean injection method of your choice. You can just inject a `Lifecyle` bean (which the endpoint implements) by the jms gateway's `id`. `Lifecycle` has `start()`, `stop()` and `isRunning()` methods. – Gary Russell Dec 23 '14 at 20:00
  • I made the changes you suggested by adding a StepExecutionListener and have an interesting problem. If I set auto-startup=false I get a failure when I launch the job in the UnicastingDispatcher.doDispatch() method with an error that Dispatcher has no subscribers. However, if I set auto-startup=true, job launching succeeds and the listener is called. At the end of the step I call stop on the outbound gateway and isRunning() confirms it is not running. Then I can run the job and it works fine. – Mike Rother Dec 26 '14 at 13:51
  • Yes, `start()` subscribes to the channel. You need to be sure to invoke `start()` before the partitioner starts sending step execution requests. So doing it in `beforeStep()` should work as long as it's invoked on the same thread; I suggest you turn on DEBUG logging to see how/why messages are sent from the partitioner before the adapter is started. – Gary Russell Dec 26 '14 at 14:15
  • If you can't figure it out from the DEBUG logs, post them someplace (e.g. gist.github.com). Be sure to set `org.springframework` log level to `DEBUG` and make sure the thread name (`%t` in log4j) is included. Also show your listener code. – Gary Russell Dec 30 '14 at 16:54