In the current project we need to run some quite complicated calculations on the data exported from our system. The calculations are handled by a third-party software (which is basically a black box for us). We have this software as Linux or Windows binaries, and know how to execute it with our data in the command line.
Processing a single dataset on one CPU core takes around 200 hours. However, we may split the dataset into smaller dataset (structurally equivalent) and run calculations in parallel. Later on, we can easily aggregate the results. Our goal is to be able to process each dataset under 10 hours.
Our customer has a proprietary job processing application. The interface is file system-based: we copy job's EXE-file (yep, it's Windows-backed) and the configuration INI file to the incoming folder, the job processing app executes this job on one of the nodes (handling errors, failover etc.) and finally copies the results to the outgoing folder. This proprietary job processing system has several hundreds of CPU cores, so there's clearly enough power to handle our dataset under 10 hours. Even under 30 minutes.
Now, the thing is, our application is so far J2EE-based, more-or-less standard JBoss app. And we need to:
- integrate with a proprietary queue-like job processing system and
- split/aggregate our datasets in a reliable fashion.
To me, many of the parts of what we have to do look very similar to Enterprise Application Intergation Patterns like Splitter and Aggregator. So I was thinking if Apache Camel would be a good fit for the implementation:
- We'll construct our jobs (EXE + INI + dataset) in form of messages.
- A splitter would divide large job messages into smaller ones by dividing the dataset into several smaller datasets.
- We'll probably need to implement an own messaging channels to write messages into incoming directory or read messages from outgoing directory of the proprietary job processing system.
- We'll need an aggregator to aggregate the results of job parts into one single result of a job.
However, I have no experience with Apache Camel yet so I've decided to ask advice on the applicability.
Given the problem described above, do you think Apache Camel would be a good match for the task?
Closing note: I'm not looking for external resources or a tool/library suggestion. Just a confirmation (or the opposite), if I'm on the right track with Apache Camel.