1

Using pentaho kettle (also known as pdi), I have a "Join Rows (cartesian product)" step which merges two streams of data.

Both the first and second stream have a numeric value attached. For example,

Stream 1 - Values 1, 3, 5
Stream 2 - Values 2, 4, 6

I want to join the two streams to get the following output:

(1, 2)
(3, 4)
(5, 6)

I would describe the correct output as having stream 1 pick the smallest value which is larger than the value from stream 1.

Within the Join Rows step, I can specify stream 2 having a value greater than the stream 1 value. Unfortunately, this produces the following incorrect outcome:

(1, 2)
(1, 4)
(1, 6)
(3, 4)
(3, 6)
(5, 6)

Is there a different step that I should use instead of "Join Rows" in Kettle? Or am I missing a setting on the join rows step?

Note: I also looked at using a Stream Lookup step, but it only works for equals and not for my logic.

Thanks.

David
  • 1,097
  • 5
  • 20
  • 34

1 Answers1

1

You're already half way there.

  • You have two inputs: Stream1 (1, 3, 5) and Stream2(2, 4, 6)
  • You join rows (make sure you sort them before joining) on value(stream2) > value(stream1)
  • You sort resulting stream on {value(Stream1), value(Stream2) This gives you
(1, 2)
(1, 4)
(1, 6)
(3, 4)
(3, 6)
(5, 6)
  • Put the "Add Value Fields Changing Sequence" step and set the "Init sequence if value of the following fields change" to value(Stream1). Resulting stream is:
(Stream1, Stream2, result)
(1, 2, 1)
(1, 4, 2)
(1, 6, 3)
(3, 4, 1)
(3, 6, 2)
(5, 6, 1)
  • Put a filter step and filter on "result=1".
  • Resulting stream from "true" branch of the filter is the deisired result.

I uploaded "example.ktr" with the solution (I used Kettle 4.3. version):

example.ktr

Milos
  • 192
  • 3
  • 11