I have an architectural question about how to handle big tasks both transactional and scalable in Java/Java EE.
The general challenge
I have a web application (Tomcat right now, but that should not limit the solution space, so just take this to illustrate what I'd like to achieve). This web application is distributed over several (virtual and physical) nodes, connected to a central DBMS (MySQL in this case, but again, this should not limit the solution...) and able to handle some 1000s of users, serving pages, doing stuff, just as you'd expect from your average web-based information system.
Now, there are some tasks which affect a larger portion of data and the system should be optimized to carry out these tasks reasonably fast. (Faster than processing everything sequentially, that is). So I'd make the task parallel and distribute it over several (or all) nodes:
(Note: the data portions which are processed are independent, so there are no database or locking conflicts here).
The problem is, I'd like the (whole) task to be transactional. So if one of the parallel subtasks fails, I'd like to have all other tasks rolled back as a result. Otherwise the system would be in a potentially inconsistent state from a domain perspective.
Current implementation
As I said, the current implementation uses Tomcat and MySQL. The nodes use JMS to communicate (so there is a JMS server to which a dispatcher sends a message for each subtask; and executors take tasks from the message queue, execute them, and post the results to a result queue from which the dispatcher collects the results. The dispatcher blocks and waits for all results to come in and if anything is fine, it terminates with an OK status.
The problem here is that all the executors have their own local transaction context, so the picture would look like this:
If for some reason one of the subtasks fails, the local transaction is rolled back and the dispatcher gets an error result. (There is some failsafe mechanism here, which tries to repeat the failed transaction, but let's assume for some reason, the one task cannot be completed). The problem is that the system now is in a state where all transactions but one is already committed and completed. And because I cannot get the one final transaction to finish successfully, I cannot get out of this state.
Possible solutions
These are the thoughts which I have followed so far:
I could somehow implement a domain-specific rollback mechanism myself. Because the distributor knows which tasks have been carried out, it could revert the effects explicitly (e.g. storing old values somewhere and revert already committed values back to the previous values). Of course, in this case, I must guarantee that no other process changes something in between, so I'd also have to set the system to a read-only state, as long as the big operation is running. More or less, I'd need to simulate a transaction in business logic ...
I could choose not to parallelize and do everything on a single node in one big transaction (but as stated at the beginning, I need to speed up processing, so this is not an option...)
I have tried to find out about XATransactions or distributed transactions in general, but this seems to be an advanced Java EE feature, which is not implemented in all Java EE servers, and which would not really solve that basic problem, because there does not seem to be a way to transfer a transaction context over to a remote node in an asynchronous call. (e.g. section 4.5.3 of EJB Specification 3.1: "Client transaction context does not propagate with an asynchronous method invocation. From the Bean Developer’s view, there is never a transaction context flowing in from the client.")
The Question
Am I overlooking something? Is it not possible to distribute a task asynchronously over several nodes and at the same time have a (shared) transactional state which can be rolled back as a whole?
Thanks for any pointers, hints, propositions ...