looping in a Kettle transformation

Question

I want to repetitively execute an SQL query looking like this:

SELECT '${date.i}' AS d, 
  COUNT(DISTINCT xid) AS n
FROM table 
WHERE date 
  BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY) 
    AND '${date.i}'
;

It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.

That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?

The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.

Here's the flow chart:

enter image description here

Raffael · Accepted Answer · 2013-04-11T18:02:49.050

Flow of the transformation:

enter image description here

In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).

In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.

The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.

Content of the resulting text output for 2013-01-01 until 2013-01-05:

d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687

I am not sure if this is the slickest solution but it does the trick.

enter image description here

Gordon Seidoh Worley · Answer 2 · 2013-04-10T14:00:52.703

In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.

ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.

You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.

FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.

looping in a Kettle transformation

2 Answers2