Pentaho PDI: execute transformation for each line from CSV?

Question

Here's a distilled version of what we're trying to do. The transformation step is a "Table Input":

SELECT DISTINCT ${SRCFIELD} FROM ${SRCTABLE}

We want to run that SQL with variables/parameters set from each line in our CSV:

SRCFIELD,SRCTABLE
carols_key,carols_table
mikes_ix,mikes_rec
their_field,their_table

In this case we'd want it to run the transformation three times, one for each data line in the CSV, to pull unique values from those fields in those tables. I'm hoping there's a simple way to do this.

I think the only difficulty is, we haven't stumbled across the right step/entry and the right settings.

Poking around in a "parent" transformation, the highest hopes we had were:

We tried chaining CSV file input to Set Variables (hoping to feed it to Transformation Executor one line at a time) but that gripes when we have more than one line from the CSV.
We tried piping CSV file input directly to Transformation Executor but that only sends TE's "static input value" to the sub-transformation.

We also explored using a job, with a Transformation object, we were very hopeful to stumble into what the "Execute every input row" applied to, but haven't figured out how to pipe data to it one row at a time.

Suggestions?

score 1 · Answer 1 · answered Apr 02 '22 at 19:44

Aha!

To do this, we must create a JOB with TWO TRANSFORMATIONS. The first reads "parameters" from the CSV and the second does its duty once for each row of CSV data from the first.

In the JOB, the first transformation is set up like this:

Options/Logging/Arguments/Parameters tabs are all left as default
In the transformation itself (right click, open referenced object->transformation):
- Step1: CSV file input
- Step2: Copy rows to result <== that's the magic part

Back in the JOB, the second transformation is set up like so:

Options: "Execute every input row" is checked
Logging/Arguments tabs are left as default
Parameters:
- Copy results to parameters, is checked
- Pass parameter values to sub transformation, is checked
- Parameter: SRCFIELD; Parameter to use: SRCFIELD
- Parameter: SRCTABLE; Parameter to use: SRCTABLE
In the transformation itself (right click, open referenced object->transformation):
- Table input "SELECT DISTINCT ${SRCFIELD} code FROM ${SRCTABLE}"
- Note: "Replace variables in script" must be checked

So the first transformation gathers the "config" data from the CSV and, one-record-at-a-time, passes those values to the second transformation (since "Execute every input row" is checked).

So now with a CSV like this:

SRCTABLE,SRCFIELD
person_rec,country
person_rec,sex
application_rec,major1
application_rec,conc1
status_rec,cur_stat

We can pull distinct values for all those specific fields, and lots more. And it's easy to maintain which tables and which fields are examined.

Expanding this idea to a data-flow where the second transformation updates code fields in a datamart, isn't much of a stretch:

SRCTABLE,SRCFIELD,TARGETTABLE,TARGETFIELD
person_rec,country,dim_country,country_code
person_rec,sex,dim_sex,sex_code
application_rec,major1,dim_major,major_code
application_rec,conc1,dim_concentration,concentration_code
status_rec,cur_stat,dim_current_status,cur_stat_code

We'd need to pull unique ${TARGETTABLE}.${TARGETFIELD} values as well, use a Merge rows (diff) step, use a Filter rows step to find only the 'new' ones, and then a Execute SQL script step to update the targets.

Exciting!

Pentaho PDI: execute transformation for each line from CSV?

1 Answers1