3

I'm reading an excel sheet using Kettle and this sheet has three fields: code, description and date.

Example:

1 - description A - 01/JAN/2013
2 - description A - 15/JAN/2013
3 - description A - 04/JAN/2013
4 - description B - 02/JAN/2013
5 - description B - 16/JAN/2013
6 - description B - 11/JAN/2013

What I want is to put an additional field Max date for each row, where max date is the latest date on a group based on description, like this:

Example 2

1 - description A - 01/JAN/2013 - 15/JAN/2013
2 - description A - 15/JAN/2013 - 15/JAN/2013
3 - description A - 04/JAN/2013 - 15/JAN/2013
4 - description B - 02/JAN/2013 - 16/JAN/2013
5 - description B - 16/JAN/2013 - 16/JAN/2013
6 - description B - 11/JAN/2013 - 16/JAN/2013

The group of rows with description A has the latest date 15/JAN/2013 (because of row 2). The group of rows with description B has the latest date 16/JAN/2013 (because of row 5).

jacktrade
  • 3,125
  • 2
  • 36
  • 50
rafaelim
  • 644
  • 6
  • 13

3 Answers3

1

This is the perfect job for the row denormalizer step:

http://wiki.pentaho.com/display/EAI/Row+denormaliser

jacktrade
  • 3,125
  • 2
  • 36
  • 50
  • Agreed - This step is confusing to use though, so check the samples and all will become clear! – Codek May 21 '13 at 05:26
  • I'll try the row de-normalizer! – rafaelim May 21 '13 at 23:29
  • Wouldn't denormalizer result in only 1 record for all records with same key(so in the example result would be only 2 records instead of 6)? Should be row denormalizer used to get max value and then some merge to stick this value to every record with same key? – Dolfa May 29 '13 at 12:00
  • getting the "description" column as key will return merged values , see the doc link sample – jacktrade May 29 '13 at 20:27
1

It's actually easier just to split your data in two streams (copy to consecutive steps) use stream 1 using the Group Step and base the group on the description and set as aggregate max value to find the maximum date for each group (should result in two output rows based on your example).

description A - 15/JAN/2013
description B - 16/JAN/2013

Next you use for stream 2 a Stream look-up Step to look up the max_date aggregate from the Group Step - stream 1 as use the description fields as key fields and use max_date aggregate as the value to retrieve.

Remember as you use the group step, it's necessary to sort your data accordingly.

bsecker
  • 86
  • 2
0

You can accomplish this using Max as an array formula.

So, suppose you had your original data in columns A thru C in your sheet, in cell D1, you would type:

=MAX(--($B$1:$B$4=B1)*($C$1:$C$4))

And then, to make it an array formula, you hit CTRL + SHIFT + ENTER

Of course, change the range for columns B & C to match your dataset.

Hope this helps and makes sense.

John Bustos
  • 19,036
  • 17
  • 89
  • 151