0

I need guidance regarding the most approriate approach to perform a index function using pentaho Data integration ( kettle )

my situation is as following :

using the GLOBAL voip system report, I stored all data in a Mysql Database, which gives me several id number + name and lastname but whithout the departement name.

each departement name has it's own excel reports that can be identified by the group file name, which is not available in the Global file.

what i am trying to achieve is a lookup for each identification number to identify the departement where he belongs using the report filename and store it on the approriate column.

Any help will be appreciated.

  • Can you please give some precision about the structure of your Excels. If I understand each Excel is for a department (which you do not know from the main stream) and contains one sheet par id. – AlainD Jul 24 '18 at 13:34
  • Hello Alain, Thank you for the feedback. Here is an example of a excel report containing the ID's and the departement name. which also mentionned on the filename P004 is the id of the departement name. i have all global ID's on a database column. or / and a global excel extraction. https://github.com/rootinshell/stack/blob/master/P004_Queue%20performance__2018072318480674.xls PS. i did somechanges to hide informations Regards. – Rootinshell Jul 25 '18 at 09:06
  • So, you want to know that, for example, id=784521 belongs to P004_Queue performance, where 784521 comes from Column B of the xls file named P004_Queue performance_*.xls ? – AlainD Jul 26 '18 at 11:13

2 Answers2

1

Assuming you're using the Excel File Input step, there is an option on the Additional Output Fields tab that will allow you to specify the Full Filename Field. You can name this whatever you want, and it will add an additional column to your incoming Excel data that has the name of the file as one of the columns. You may need to do some regex cleanup on that fields since it's the full file path, not just the filename.

As far as doing the lookup, there are many lookup options to merge streams in the Lookup category of the design tab. I think the Stream Lookup is the step you'll want.

enter image description here

eicherjc
  • 301
  • 1
  • 4
0

As far as I understood your need, you have to first build a "mapping table" of two columns: the department (aka the start of the xls filename) and the employee (aka its ID).

This table does not need to be materialized and may stay in a step of the the PDI. So

  1. Read all the xls files with a Microsoft Excel File. In case you do not know how to do it: Browse to any of these file, press the Add button, then in the Selected files table, remove the filename to keep only its directory path and write .*\.xls in the Regex wildcard. Check you select the appropriates files with the Show filename button.

    In the same step, define the Sheet to be "Fiche technique" (assuming they are all the same). Define the field to be "A" with type String (an empty column) and "ID" also with type String (otherwise you'll have a un-trappable error on "Agent ID" and "Total". Also follow @eicherjc suggestion and keep the filename, although I suggest you keep the Short file name and call it filename.

You should get a two column stream: ID and filename, which need some bit of data massage before to be used. The ID contains non-integer fields and the file name contains extra characters.

  1. The simplest way to do this is with a Modified Javascript Value. I may suggest the code:

    var ID = Number(ID);
    var regex = filename.match(/(.*)__\d+\.xls/);
    if(regex) filename = regex[1];

    and do not forget specify the the ID has now a type Integer and to put a "Y" in the Replace value in field of the Fields`` table at the bottom.

    The first line will convert any number in its value, and non-number in a 0, which is an ID that does not exists.

    The next lines will extract the department from the filename with a Regex. If you do not like regex, you may use a filename = filename.substr(0, filename.indexOf('__')), or any formula that will do the job.

Now you have a stream ready to be used, except that some employees may, right or wrong, be in more than one department. If it does not matter which one, then leave it like that. Otherwise you have to provide some logic to filter the correct department.

  1. You can now use a Lookup Stream to read the department of each employee. The Lookup step is the Modified Javascript value (or whatever name you gave to this step). The field to lookup is the field of the ID in your mySql. The Lookup field is the ID (or whatever name you gave to the column B of your xls files). And the field to retrieveenter code here is the filename (or more precisely, the department name extracted from the filename).

enter image description here

Community
  • 1
  • 1
AlainD
  • 6,187
  • 3
  • 17
  • 31