1

In the current project we need to run some quite complicated calculations on the data exported from our system. The calculations are handled by a third-party software (which is basically a black box for us). We have this software as Linux or Windows binaries, and know how to execute it with our data in the command line.

Processing a single dataset on one CPU core takes around 200 hours. However, we may split the dataset into smaller dataset (structurally equivalent) and run calculations in parallel. Later on, we can easily aggregate the results. Our goal is to be able to process each dataset under 10 hours.

Our customer has a proprietary job processing application. The interface is file system-based: we copy job's EXE-file (yep, it's Windows-backed) and the configuration INI file to the incoming folder, the job processing app executes this job on one of the nodes (handling errors, failover etc.) and finally copies the results to the outgoing folder. This proprietary job processing system has several hundreds of CPU cores, so there's clearly enough power to handle our dataset under 10 hours. Even under 30 minutes.

Now, the thing is, our application is so far J2EE-based, more-or-less standard JBoss app. And we need to:

  • integrate with a proprietary queue-like job processing system and
  • split/aggregate our datasets in a reliable fashion.

To me, many of the parts of what we have to do look very similar to Enterprise Application Intergation Patterns like Splitter and Aggregator. So I was thinking if Apache Camel would be a good fit for the implementation:

  • We'll construct our jobs (EXE + INI + dataset) in form of messages.
  • A splitter would divide large job messages into smaller ones by dividing the dataset into several smaller datasets.
  • We'll probably need to implement an own messaging channels to write messages into incoming directory or read messages from outgoing directory of the proprietary job processing system.
  • We'll need an aggregator to aggregate the results of job parts into one single result of a job.

However, I have no experience with Apache Camel yet so I've decided to ask advice on the applicability.

Given the problem described above, do you think Apache Camel would be a good match for the task?

Closing note: I'm not looking for external resources or a tool/library suggestion. Just a confirmation (or the opposite), if I'm on the right track with Apache Camel.

lexicore
  • 42,748
  • 17
  • 132
  • 221

3 Answers3

3

I think Apache Camel is suitable for your needs, since it is one of the best integration frameworks I found so far.

My current project involves dealing with ECM, having to process a huge quantity of documents that may reach the number of 1 million/day.

As input we have XML files representing a group of documents (or lot of documents) along with links to real files stored on a NAS.

First of all we had to transform all these XML files in a proprietary XML format which is suitable for the proprietary document importer used by our ECM system (our blackbox) and split them into smaller pieces in order to exploit more than one importing queue.

Then we had to monitor the importer queues and dispatch them properly in order to balance the queue load and after that operation we had to find out the result of the operation reading from an output proprietary format XML file generated by the importer.

Between every step of this process there was an ActiveMQ queue (with database persistence) in order to keep everything asynchronous and every single phase could be scaled up increasing the number of concurrent consumers on that specific queue.

Also our microservices are part of a enormous and lengthy workflow managed by an ESB, so we get input messages from ESB provided queues and wrote output messages back again to these queues using small Web Services to get/set the objects.

We decided to go for Camel since it has solved many integration problems, it gives complete control to every single route and can be easily monitored by hawtio.

Moreover most of configuration is done by writing or modifying xml context files, providing you flexibility and saving you from writing a lot of code. The community is lively, the framework is updated very often and you can find plenty of books and tutorials.

So I think that your problem has many point of contacts and affinities compared to my project aim, so again, I definitely decided to use Apache Camel.

With very good results.

abarisone
  • 3,707
  • 11
  • 35
  • 54
2

You have quite a complicated use case over there. Let me re-phrase what you would like to do in a simple format and provide my thoughts. If you see I miss understood something just leave me a comment and I will revise my post.

JBoss based J2EE application that has a large dataset that needs to be transformed split into smaller pieces and then transformed into a custom format. This format will then be written out to disk and processed by another application which will create new data results in an output folder on the disk. You then want to pick up this output and aggregate the results.

I would say that apache camel can do this, but you will have to take the time to properly tune the system to your needs and setup a few custom configurations on your components. I imagine this process looking something like:

from("my initial data source")
    .split().method(CustomBean.class, "customSplitMethod")
        //You might want some sort of round robin pattern to 
       //distribute between the different directories 
        .to("file://customProgramInputDirectory");

from("file://customProgramOutputDirectory")
    .aggregate(constant(true), new MyCustomAggregationStratedgy())
    .to("output of your data source");

Since you said you will be integrating with a "proprietary queue-like job processing system", I might have misunderstood the input and output of the other program to be fileDirectories, if it is a queue based system and it supports jms there is a generic template you can use, if not its always possible to create a custom camel component so your pattern would just change from saying 'file://' to 'MyCustomEndpoint://'

Matthew Fontana
  • 3,790
  • 2
  • 30
  • 50
  • Thank you very much for your answer. The proprietary application is, indeed, filesystem-based, no JMS or anything similar. I was also thinking a similar configuration, but with more intermediate message translators from our business model to files and configs the Job processing app expects. – lexicore Oct 14 '15 at 17:00
-2

The answer is NO - Camel is not the best framework to do it even it can be stretch to imitate what's you describe.

Apache Camel does perform some splitting upon incoming unity of work identify as Exchange which can - of course - be a file (using the camel-file component). BUT, when splitting, each "chunk" is then sent to a dedicated Processor.

The problem is that the chunk is an Exchange itself and meant to be put in memory (to be able to perform tasks in parallel later). In your case, I assume that the part of the data are still too big to be processed in memory. If not, Camel answers your needs and even perform all the polling required to integrate with the system you described.

You ask not to propose anything, but if I were you I would give a try on Spring Batch instead.

  • Our datasets are actually rather small. The whole dataset is around 80MB. When split into parts, these parts share around 95% of the data. So we have quite a small memory footprint. We use Spring Batch-like interfaces in other parts of the system, this was not quite enough for our integration task. I appreciate your answer anyway. – lexicore Oct 14 '15 at 16:58
  • no offense, as your data do not have that much high volume camel is indeed possible using several routes as M. Fontana suggest. – Cédrick Lunven Oct 14 '15 at 18:36
  • *no offense* - absolutely none taken (not even considered), the downvote is not from me (and I don't think it's deserved). – lexicore Oct 14 '15 at 19:03