0

I'm working with a project using Kettle (PDI). I have to input multiple file of .csv or .xls and insert it into DB.

The file name are AAMMDDBBBB, where AA is code for city and BBBB is code for shop. MMDD is date format like MM-DD. For example LA0326F5CA.csv.

The Regexp I use in the Input file steps look like LA.\*\\.csv or DT.*\\.xls, which is return all files to insert it into DB.

Can you indicate me how to select the files the file just for yesterday (based on the MMDD of the file name).

Rio Odestila
  • 125
  • 2
  • 19

2 Answers2

0

As you need some "complex" logic in your selection, you cannot filter based only on regexp. I suggest you first read all filenames, then filter the filenames based on their "age", then read the file based on the selected filenames.

In detail:

  1. Use the Get File Names step with the same regexp you currently use (LA.*\.csv or DT.*\.xls). You may be more restrictive at that stage with a Regexp like LA\d\d\d\d.....csv, to ensure MM and DD are numbers, and DDDD is exactly 4 characters.

  2. Filter based on the date. You can do this with a Java Filter, but it would be an order of magnitude easier to use a Javascript Script to compute the "age" of you file and then to use a Filter rows to keep only the file of yesterday.

    To compute the age of the file, extract the MM and DD, you can use (other methods are available):

    var regexp = filename.match(/..(\d\d)(\d\d).*/);
    if(regexp){
        var age = new Date() - new Date(2018, regexp[1], regexp[2]);
        age = age /1000 /60 /60 /24;
        };

If you are not familiar with Javascript regexp: the match will test the filename against the regexp and keep the values of the parenthesis in an array. If the test succeed (which you must explicitly check to avoid run time failure), use the values of the match to compute the corresponding date, and subtract the date of today to get the age. This age is in milliseconds, which is converted in days.

  1. Use the Text File Input and Excel Input with the option Accept file from previous step. Note that CSV Input does not have this option, but the more powerful Text File Input has.
AlainD
  • 6,187
  • 3
  • 17
  • 31
  • Okay, thanks for your answer. I will try and reply the result. – Rio Odestila Apr 20 '18 at 01:30
  • in the secont point that u says, using Java Filter, its have field like destination and condition, so js that u write i can put it in coindition? or js is use another tools? #NewbieStillLearning – Rio Odestila Apr 20 '18 at 01:52
  • my transformation flow like : Get Filename - Java Filter - Text File Input - Insert / Update – Rio Odestila Apr 20 '18 at 02:07
  • References for Java Filter is there: https://wiki.pentaho.com/display/EAI/Java+Filter. – AlainD Apr 20 '18 at 11:35
  • However, as I said, it will be much easier to develop and maintain, if your flow looks like: Get Filename --> Javascript (for age) --> Filter row--> Text File Input --> Insert/Update. – AlainD Apr 20 '18 at 11:38
0

well I change the Java Filter with Modified Java Script Value and its work fine now. Another question, how can I increase the Performance and Speed of my current transformation(now I have 2 trans. for 2 cities)? My insert update make my tranformation to slow and need almost 1hour and 30min to process 500k row of data with alot of field(300mb) and my data not only this, if is it work more fast and my company like to used it, im gonna do it with 10TB of data/years and its alot of trans and rows. I need sugestion about it

Rio Odestila
  • 125
  • 2
  • 19
  • Please, If it is another question then ask an other question. It is rather difficult to answer a answer. – AlainD Apr 23 '18 at 06:57
  • You have many ways, but it relates to your database. Fist try to change the commit size . Second (and more efficient) do not use the Insert/Update [you have to look for an existing record, erase it and put the new one: 3 database operations]. Instead use the `Merge diff` step which tells you if a record is new/changed/deleted or identical, then a `Switch/Case` step, then one of `Insert Table`, `Update` or `Delete`. Third (even more efficient but requires lot of storage space), create a new table with all the data, erase the old one and rename the new table. – AlainD Apr 23 '18 at 07:06