0

I'm new with PDI and still learn about it. I'm trying to create transformation that will read all the csv file from one folder, check if the data of the file is correct, meaning there is no rows with missing/error/wrong format, then store it in a database.

What I have try is :

  1. Use Text File Input accessing CSV file in FTP using Apache Common VFS.
  2. Validate and make condition to check the data (checking filename, field if exist) in CSV using Filter Row
  3. Output into PostgreSQL Table using Syncronize After Merge. I used this because I also join CSV data with data from another table.

The result from my second step is not what I want. Currently it checks after all csv are read and pass all the data to next step but what I want is to check while read the data so it will pass only correct data to next step. How can I do that? any suggestion? (need brainstorming)

And if that impossible to implement in PDI then it's okay to read all data and pass it to the next step but then will validate again before insert the data.

Rio Odestila
  • 125
  • 2
  • 19
  • Can you give precision on the condition. The `Filter Row` works row by row, leaving in the flow only the rows matching the condition. You seam to need a global condition, accepting or rejecting the whole file based on some condition. In any case if that condition is `missing/error/wrong format?, then you should read the whole file before to decide. Please clarify. – AlainD Jul 18 '18 at 07:16
  • Yeah, it will read whole file first then Comparing filename with data from table to check it is correct file, check all field are correct to and rows in it not null. If one of those condition not fit then data from that file not process and not insert event in that file only 1 row null. – Rio Odestila Jul 18 '18 at 08:17

1 Answers1

0

You can only validate a file after all its data has been completely read and checked.

The good way to do this is a job to orchestrate several transformation (one to read the directory, one to check if the files are valid, one to load the data of the validated files).

Now writing a job may seam a daunting task until you have written 1/2 a dozen. So you can have it in one transform. In facts, it a pattern to take decisions or make computations based on indicators defined on the whole input data.

  1. Get the list of files.
  2. Read them keeping track of the filename (in the Additional output field tab).
  3. Make the check line by line as you did.
  4. Make a summary to reject if there is at least one error.
  5. Take back the main stream of 2, and for each row lookup if the filename was rejected. (The lookup stream is the result of the group by).
  6. Filter out the rows with a rejected filename.
  7. Put them on the Postgres (after enriching the data with other file or tables).

Just a remark. In your specific case, I would change a bit the flow, testing for the accepted filename in the first filter, and removing group by and the second filter. But I thought it would be more useful for you to have the standard pattern.

But, again, for various reason, good practice would be to do it with a master job.

enter image description here

AlainD
  • 6,187
  • 3
  • 17
  • 31
  • I don't know that I can put `Text File Input` after `Get File Names`. Nice info by the way. And what the used of `Get File Names`? It does't pass any field to the next step. – Rio Odestila Jul 19 '18 at 02:06
  • how about error handling in Text File Input? I dont understand how to used it? is possible to do my requirement without set status just Filter file that have error format etc? – Rio Odestila Aug 02 '18 at 10:05
  • Matt Caster, the author of PDI, once told me : "Read the doc". It was brutal, but it is there : https://wiki.pentaho.com/display/EAI/Text+File+Input. And when you try to connect the step to the next one, Spoon asks you if its the normal step or the error step. Which you can also do with e right-click on the step. – AlainD Aug 02 '18 at 10:34