0

Using PDI (Kettle) I am filling the entry-stage of my database by utilizing a CSV Inputand Table Output step. This works great, however, I also want to make sure that the data that was just inserted fulfills certain criteria, e.g. fields not being NULL, etc.

Normally this would be a job for database constraints, however, we want to keep the data in the database even if its faulty (for debugging purposes. It is a pain trying to debug a .csv file...). As it is just a staging table anyway it doesn't cause any troubles for integrity, etc.

So to do just that, I wrote some SELECT Count(*) as test123 ... statements that instantly show if something is wrong or not and are easy to handle (if the value of test123 is 0 all is good, else the job needs to be aborted).

I am executing these statements using a Execute SQL Statements step within a PDI transformation. I expected the result to be automatically passed to my datastream, so I also used a Copy rows to result step to pass it up to the executing job.

This is the point where the problem is most likely located. I think that the result of the SELECT statement was not automatically passed to my datastream, because when I do a Simple evaluation in the main job using the variable ${test123} (which I thought would be implicitly created by executing SELECT Count(*) as test123 ...) I never get the expected result.

I couldn't really find any clues to this problem in the PDI documentation so I hope that someone here has some experience with PDI and might be able to help. If something is still unclear, just hint at it and I will edit the post with more information.

best regards

Edit: This is a simple model of my main job:

Start --> Load data (Transformation) --> Check data (Transformation) --> Simple Evaluation --> ...

daZza
  • 1,669
  • 1
  • 29
  • 51
  • So you do these SQL Statements in a transformation where the datastream comes from the job or by checking it by connecting to your database? – Seb Aug 27 '14 at 11:12
  • The `Execute SQL Statements` step connects to the database and executes the statement. This is done in a separate transformation, which is called by a job. – daZza Aug 27 '14 at 11:14
  • Is SQL a must? What about a JavaScript-Step in your "Check data"-Transformation? – Seb Aug 27 '14 at 11:25
  • What would JavaScript do in that case? I need to query my data to see if its in order or not – daZza Aug 27 '14 at 11:30
  • ok sorry. I thought your data is already there. As an idea: maybe you need to generate an empty row named test123 before the sql-step – Seb Aug 27 '14 at 11:48
  • and what if you test the incoming data of your sql-step with a select values? what rows are displayed. I could imagine that pentaho has problems detecting your new rows. – Seb Aug 27 '14 at 11:50
  • Well, the data is there. But I can't be sure that it is also correct (hence the "check data" transformation) – daZza Aug 27 '14 at 11:50
  • When I try to select it with `Select values` PDI throws an error: `Couldn't find field '${test123}' in row!` which seems to confirm my notion that the results is not automaticall pushed into the datastream – daZza Aug 27 '14 at 11:56
  • yes. did you check the box with variables in your sql-step? there shoul be a checkbox for that. if thats not the point: what comes along by clicking the Get Fields Button in the Select-Values-Step? Nothing? Than try this mentioned step Generate Rows so that PDI knows the new row. hope it helps – Seb Aug 27 '14 at 12:03
  • You mean the "variable substitution" checkbox? No I did not check that, because I am not substituting any variables in my script. When I click the Get Fields button nothing happens at all. I will try the generate rows step, maybe that helps. Edit: Nope, no change when using it. – daZza Aug 27 '14 at 12:06
  • mmh. your transformation should start with generate rows, where you define test123 as a new field (string) 2step is your sql with your statement, select ... AS 'test123' 3step is select values, click on get fields should bring test123 as a field or maybe two fields test123 and test123_1 (i know this from an htmlstep i use) – Seb Aug 27 '14 at 12:14
  • a new idea: take a table input step instead of your execute sql-step, here you don't need a generate rows – Seb Aug 27 '14 at 12:18
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/60077/discussion-between-seb-and-dazza). – Seb Aug 27 '14 at 12:22

1 Answers1

1

You are mixing up a few concepts, if I read your post correctly.

  1. You don't need a Execute SQL script, this is a job for the Table input step. Just type your query in the Table input and you can preview your data and see it coming from the step into the data stream by using the preview on a subsequent step. The Execute SQL script is not an input step, which means it will not add external data to your data stream.

  2. The output fields are not Variables. A Variable is set using the Set Variables step, which takes a single input row and maps a specific field to a variable, which can be persisted at parent job or root job levels. Fields are just that: fields. They are passed from one step to the next through hops and eventually to the parent job if you have a Copy rows to result step, but they are NOT variables.

nsousa
  • 4,448
  • 1
  • 10
  • 15