How to store a variable from one line for use in later lines in Pentaho kettle?

Question

I have to process a spreadsheet that has multiple levels of aggregation within it. Mostly, this is just fine, but in one case, I need to use information from the highest aggregation level in conjunction with information from the next aggregation level. Here's an example:

Title,         Platform,    Quantity, Revenue
"Some title",          ,    100,      1000.00
            , "Platform A", 12,       120.00
            , "Platform B", 20,       200.00
            , "Platform C", 68,       680.00

"Some other title",       , 20,       200.00
            , "Platform B", 20,       200.00
...

The first record has a title, "Some Title", and aggregated quantity and revenue metrics for all platforms. This is mostly what I need to read. However, in one case, say for "Platform B", I need to retrieve the specific values for that platform, and I need to know what the title is for those values.

Anyone know how to get this done?

I would like to just note that this is exactly not congruent to how Kettle processes rows. Row processing is supposed to be parallelizable and distributable. The local lack of information and the significance of order to impute it violates the necessary assumptions. I would recommend an independent preprocessing step where you apply an LOCF (last observation carried forward) process. +1 though b/c it is a reasonable use case. — Raffael, Dec 16 '14 at 08:41

score 2 · Answer 1 · answered Dec 16 '14 at 08:29

I raised a jira for this:

http://jira.pentaho.com/browse/PDI-13296

Because it's not immediately obviously straightforward.

I can only think to do it in a "modified javascript step" a tiny snippet of code that remembers previous values of title and then substitutes in when title is null.

That does assume your data is always sorted in this way, and also assumes your first row always has a value.

If you had something you could group by (i.e. some additional attribute you're not showing?) then you could use the "pass all rows" option in that step and get the "first non null" value for the group on the aggregate for the title field.

score 0 · Answer 2 · answered May 04 '16 at 16:43

To me, it looks like you are trying to flatten your data (sort of).

I'm probably doing it wrong, but here is what worked for me with a similar data set:

(Assuming your data is already in order as shown above)
(Marker) Formula Step: IF([TITLE]="";1;0) - This flags 0 at the start of a new group
(Seq) Sequence Step: Just a straight sequence starting with 1
(SeqCounter) Add value fields, changing sequence step: Use "Marker" as the change field.
(Factor) Calculator Field: A*B; SeqCounter * Marker
(Group) Calculator Field: A+B; Seq + Factor

This SHOULD give you a "GROUP" field that tells you which group each line is part of.

Then I just split out the data and rejoin using the GROUP as the JOIN field.

I know this is an old post, but since I did not find a solution when I searched, I thought I would put this up for others to find.

How to store a variable from one line for use in later lines in Pentaho kettle?

2 Answers2