What is the scope of result rows in PDI Kettle?

Question

Working with result rows in kettle is the only way to pass lists internally in the program. But how does this work exactly? This topic has not been well documented and there's a lot of questions.

For example, a job containing 2 transformation can have result rows sent from the first to the second. But what if there's a third transformation getting the result rows? What is the scope? Can you pass result rows to a sub-job as well? Can you clear the result rows based on logic inside a transformation?

Working with lists and arrays is useful and necessary in programming, but confusing in PDI Kettle.

score 0 · Accepted Answer · answered Jun 11 '18 at 15:20

0

I agree that working with result rows may be confusing, but you can be confident: it works.

Yes, you can pass it the a sub-job, and in a series of sub-jobs (define the scope as "valid in the java machine" for the first test).

And no, there is no way to clear the results in a transformation (and certainly not based on a formula). That would mean a terrible overload in maintenance.

Kettle is not an imperative language, it is more of the data-flow family. It means it is nearer the way you are thinking when developing an ETL and much, much more performant. The drawback is that list and array have no meaning. Only flow of data.

And that is what is a result set : a flow of data, like the the result set of a sql query. The next job has to open it, pass each row to the transformation, and close it after the last row.

answered Jun 11 '18 at 15:20

AlainD

6,187
3
17
31

Thank you for the insight. I wish this topic was better documented. – Phoenexus Jun 12 '18 at 07:02
Ten years ago, the only documentation was the java source code. It's no longer the case. You may be interested in *Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration* by Matt Casters, Roland Bouman, Jos van Dongen – AlainD Jun 12 '18 at 17:26
As much as I appreciate suggestions of good literature, it doesn't help the case of having documentation for how result rows work between jobs. The book you recommend doesn't include any information about the scope of result rows. I have been working with Pentaho ETL for a couple of years and are at the moment only interested in the advanced and niche topic detailed here. – Phoenexus Jun 13 '18 at 18:27
Well, in two words : unlike the naive impression, the data does not move in the PDI. It is read once for all and then a set of pointers tells in which step it is. This set of pointer is destroyed when the transformation finishes, except for the rows that have been put in the "result". In that case, it is pushed on a stack with maximal scope (if I do remember). At job level, the mechanism is similar, except that you can control the scope. – AlainD Jun 14 '18 at 13:23
The development has been subject to a lot of trial and errors, guided by user experience of various levels. So it a domain where in theory there is no difference between theory and practice, but in practice, there is. – AlainD Jun 14 '18 at 13:25
What do you mean by _"define the scope as "valid in the java machine" for the first test"?_ Where would that parameter be? – leokhorn May 17 '19 at 13:12

What is the scope of result rows in PDI Kettle?

1 Answers1