PDI/Kettle: avoid file creation or mapping (sub-transformation) execution

Question

It's clear by now that all steps from a transformation are executed in parallel and there's no way to change this behavior in Pentaho.

Given that, we have a scenario with a switch task that checks a specific field (read from a filename) and decides which task (mapping - sub-transformation) will process that file. This is part of a generic logic that, before and after each mapping task, does some boilerplate tasks as updating DB records, sending emails, etc.

The problem is: if we have no "ACCC014" files, this transformation cannot be executed. I understand it's not possible, as all tasks are executed in parallel, so the second problem arises: inside SOME mappings, XML files are created. And even when Pentaho is executing this task with empty data, we can't find a way of avoiding the XML output file creation.

We thought about moving this switch logic to the job, as in theory it's serial, but found no conditional step that would do this kind of distinction.

We also looked to Meta Data Injection task, but we don't believe it's the way to go. Each sub-transformation does really different jobs. Some of them update some tables, other ones write files, other ones move data between different databases. All of them receive some file as input and return a send_email flag and a message string. Nothing else.

Is there a way to do what we are willing for? Or there is no way to reuse part of a logic based on default inputs/outputs?

Edit: adding ACCC014 transformation. Yes, the "Do not create file at start" option is checked.

Are these xml files created by XML Output step? Back in 5.2, the XML Output step had "Do not create file at start" option on File tab. Have you checked it? — Andrei Luksha, Oct 28 '16 at 09:09
Why do you need to execute the ACCC014 transform if there's no file for it? If it has side affects you want, factor them out into another transform. — Brian.D.Myers, Oct 28 '16 at 18:58
@user4637357 yes, a XML file is created by ACCC014 transform, by example. I checked "do not create file at start" but the problem is that the transformation is executed anyway, even if there are no records sent to it by the switch task. — jfneis, Oct 31 '16 at 14:14
@Brian.D.Myers I don't need and, actually, that's my problem right now. The switch task sends no records to ACCC014 task, but it is executed by Pentaho anyway. I don't know, until the switch task, if I will need to execute ACCC014 or not. Am I loosing anything here? — jfneis, Oct 31 '16 at 14:16
No, all tasks are started whether they get data or not. Is the problem that it's creating unwanted files? Did you try setting the "Do not create file at start" option? If these aren't the issue, can you post the transform run by "ACCC014". — Brian.D.Myers, Oct 31 '16 at 16:00
@Brian.D.Myers yes, the option is checked and it creates the file anyway. I'm posting the transformation, anyway. If it's always executed, even without rows, how could it not create the file? Shoud I have any kind of decision before file creation to avoid it? — jfneis, Nov 01 '16 at 11:43

score 5 · Accepted Answer · answered Nov 01 '16 at 17:26

You can use Transformation Executor step (http://wiki.pentaho.com/display/EAI/Transformation+Executor) in order to execute transformation conditionally. Though I haven’t really used this step, so I can’t say anything about it’s stability or performance.

Main transformation:

Set-up your parent transformation like this:
Regarding the Injector step: in 5.2 version, I was not able to get fields created in the sub-transformation even though they were defined on “result rows” tab, so I had to define all these fields in the Injector step instead. Not sure, if it is still necessary in current version.

Possible adjustments for Transformation Executor:

Probably, you’d want to change The number of rows to send to the transformation value on Row grouping tab: set it to 0 in order to send all rows at once instead of re-executing the transformation for every N rows.
If you want to read output of your sub-transformation, select “This output will contain the result rows after execution” option while creating the hop to the subsequent step:

Sub-transformation:

The only change you'll probably need here is to replace your mapping input and output by Get rows from result and Copy rows to result:

Known issue in 5.2: It seems like the job executor reads the output of sub-transformation not from the “Copy rows to result” step, but from the most recently created step. So, if you have added some steps to your sub-transformation, remember to re-create the step, from which you expect to read the output: just select the “Copy rows to result”, cut it, paste it back and re-create the hop.

Just to confirm: Transformation Executor isn't executed in parallel despite of not receiving rows from the previous step, as is Mapping? — jfneis, Nov 01 '16 at 17:39
According to my tests it does not execute the transformation if there are no input rows. — Andrei Luksha, Nov 01 '16 at 17:40
You're wright, forgot to mark it as correct! We are using Transformation Executor and everything is working fine! Tks! — jfneis, Dec 06 '16 at 17:10
Curious if you have any alternative approaches - the Transformation Executor works, but it runs really slowly. I think under the hood everything is being rebuilt/re-instantiated for each call. Even though it's not as clean, I'd rather keep everything in the same transformation and not sacrifice performance, but I don't see how to process X rows at a time within the same transformation. — WhyGeeEx, Dec 04 '18 at 17:41
Have you checked 'The number of rows to send to the transformation value' on 'Row grouping' tab? Also, sub-transformation might be faster than transformation executor, though less flexible. — Andrei Luksha, Dec 06 '18 at 10:56

PDI/Kettle: avoid file creation or mapping (sub-transformation) execution

1 Answers1