Our Spring Web Application uses Spring Batch with Quartz to carry out complex jobs. Most of these jobs run in the scope of a transaction because if one part of the complex system fails we want any previous database works to be rolled back. We would then investigate the problem, deploy a fix, and restart the servers.
It's getting to be an issue because some of these jobs do a HUGE amount of processing and can take a long time to run. As execution time starts to surpass the 1 hour mark, we find ourselves unable to deploy fixes to production for other problems because we don't want to interrupt a vital job.
I have been reading up on the Reactor implementation as a solution to our problems. We can do a small bit of processing, publish an event, and have other systems do the appropriate action as needed. Sweet!
The only question I have is, what is the best way to handle failure? If I publish an event and a Consumer fails to conduct some critical functionality, will it restart at a later time?
What if an event is published, and before all the appropriate consumers that listen for it can handle it appropriately, the server shuts down for a deployment?