Running multiple Kettle transformation on single JVM

Question

We want to use pan.sh to execute multiple kettle transformations. After exploring the script I found that it internally calls spoon.sh script which runs in PDI. Now the problem is every time a new transformation starts it create a separate JVM for its executions(invoked via a .bat file), however I want to group them to use single JVM to overcome memory constraints that the multiple JVM are putting on the batch server.

Could somebody guide me on how can I achieve this or share the documentation/resources with me.

Thanks for the good work.

Is it possible to group the transformations into a singe Job and launch that with Kitchen.sh? — Brian.D.Myers, Feb 08 '16 at 20:41
Thanks Brian, no, we do batch processing and each transformation has some external dependencies so they cannot be group together. — Explorer, Feb 08 '16 at 21:02
Hmm, can you tell us more about these dependencies? And how many batches/transformations are we talking about? Might want to update your question with that info. — Brian.D.Myers, Feb 08 '16 at 23:25
@Novice, do you need to run those transformations simultaneously? If not, Brian's answer is what you need -- just run all your transformations sequentially within one job — Andrey Khayrutdinov, Feb 09 '16 at 05:47
@Novice I also agree with Brian and Andrey's answer. You can group all the .ktr files into a single .kjb files. This will run the job using a single JVM. Incase there is a dependency then you can run the .ktr files in parallel or use conditional steps to control the flow of your .ktr files. — Rishu Shrivastava, Feb 09 '16 at 12:16
Thanks all for your reply. @Brian.D.Myers We run thousands of transformations daily and mostly the dependencies are like whether the source file is arrived or not, table load is done or not, dependencies on another transformation (which might running on some other tool), dependencies on some scripts, etc. We use a scheduling tool for scheduling and dependency management. — Explorer, Feb 09 '16 at 19:22
AFAIK, all you can do is try to push some of the condition tests down to the scheduling tool. That way you only launch a PDI transform when you know it can run. Then watch your server resources for usage spikes. You may need to adjust your schedules accordingly. — Brian.D.Myers, Feb 10 '16 at 01:24

score 1 · Accepted Answer · edited Feb 18 '16 at 16:04

1

Use Carte. This is exactly what this is for. You can startup a server (on the local box if you like) and then submit your jobs to it. One JVM, one heap, shared resource.

Benefit of that is then scalability, so when your box becomes too busy just add another one, also using carte and start sending some of the jobs to that other server.

There's an old but still current blog here:

http://diethardsteiner.blogspot.co.uk/2011/01/pentaho-data-integration-remote.html

As well as doco on the pentaho website.

Starting the server is as simple as:

carte.sh <hostname> <port>

There is also a status page, which you can use to query your carte servers, so if you have a cluster of servers, you can pick a quiet one to send your job to.

edited Feb 18 '16 at 16:04

Mr Lister

45,515
15
108
150

answered Feb 10 '16 at 11:05

Codek

5,114
3
24
38

Thanks @Codek for your reply. Is there a way I can use PDI as Carte? If I use Carte server then I won't be able to use the JVM available on PDI server and PDI will just work as a mediator between scheduler and Carte. – Explorer Feb 10 '16 at 13:02
uh, eh? Don't understand what you mean. You can start carte from the same installation where you run PDI. Are you using the DI server? if you are then you already have a carte server running! – Codek Feb 10 '16 at 16:51

Running multiple Kettle transformation on single JVM

1 Answers1