Pentaho: how to retrieve data stored in two files: one for mapping and other for rows

Question

I need read these data and then load them into oracle tables using (if possible) a single spoon transformation.

The data are stored in pairs of two files. Each pair has one file (data-type_yyyymmdd.txt) with each row placed in each line ended with carriage-return and no delimiters between the fields. The key to extracting theses fields is the second file (data-type_yyyymmdd.ref) that contains information to map these fields along the rows. Let's take a look at one example:

data-type_yyyymmdd.ref file layout:

OPERATION_ID                             A 011
USER                                     A 008
DATE_TRANSACTION                         A 0008
TIME_TRANSACTION                         N 0004

The first column is field name, second is the data type (A=alphanumeric, N=numeric), third is the field length. The columns have fixed position along the line.

IMPORTANT: The pairs could have REF file with different structures, but pairs with same kind of REF files (identified by data-type in file name) has the same structure.

data-type_yyyymmdd.txt file layout:

A line of this file

50593749120ABCDEFGH201701021444

The fields for this line have these values:

OPERATION_ID=50593749120
USER=ABCDEFGH
DATE_TRANSACTION=20170102
TIME_TRANSACTION=1444

So the pentaho transformation has to read these files, retrieve the rows/fields and insert them into a table.

How to identify the table to insert these data?

The key is data-type part of file name. We have a set of different tables and the name of them is identified by this part. Each table structure is equivalent to his related REF file. Example:

This file pair userlogins_20170701.ref and userlogins_20170701.txt has data of userlogins oracle table. Let's suppose this pair has the same structure of the example above. So the table has this structure:

ID - numeric(16)
OPERATION_ID - varchar2(11)
USER - varchar2(8)
DATE_TRANSACTION - varchar2(8)
TIME_TRANSACTION - numeric(4)

All tables has the ID field as primary key filled by a sequence database. The other fields follow the REF file structure. fields with A type are mapped to varchars field and N to numeric fields.

So you have one file that contains metadata (field name, type and length), and the other that contains the data (fixed field with length specified in the first file). Two more informations are needed: a) how to create the output table (manual, automatic once, automatic one for each for each input,...) b) What is the size? For two or three do it with hardcoded input/output chain, For a hundred, cut the txt rows based on value of the ref file, for thousands, use metadata injection,.... Please specify. — AlainD, Jul 06 '17 at 07:46
a) the tables have been previously created. You identify them by `data-type` part of filename b) We have inputs of one or many file pairs of different `data-type`. Each pair feeds a table named also `data-type`. — gwarah, Jul 10 '17 at 15:48
Then I think it is exactly the case covered http://diethardsteiner.github.io/pdi/2015/10/31/Transformation-Executor-Record-Groups.html. Note: Code by incremental improvement: do it for one pair file, before to automate it, because this automaton is a project n itself. — AlainD, Jul 11 '17 at 11:53
@AlainD, first of all, sorry for our late feedback. Excellent example. Indeed is the exactly solution in terms of Pentaho. But our team choice was a pre-processing these files with a shell/awk script in a way to generate a csv file from these originals pairs. Thanks any way. — gwarah, May 21 '18 at 18:42

score 0 · Answer 1 · answered Jul 05 '17 at 18:50

0

Here if .ref file will not change you can take stream of data from source file into "copy rows to result" then apply job with setting "run this job for each row" then apply cut string component for each row and set four variables and you can store then in four columns row by row.

answered Jul 05 '17 at 18:50

karan arora

176
9

.ref file can change according to the filename and we have one or more pairs of files to extract the data. I've included more explanation to clarify my question. Please, could you read the question again? – gwarah Jul 10 '17 at 15:55
@gwarah First, copy the data of .ref file into table .We know that there will be three fields.So, create a table of three columns. While inserting into table, using replace string component, where replace "A" with varchar2 and N with numeric and trim "0" as well from third field Now the data in table will look like below. Then dynamic table based on below values. OPR_ID varchar2 11 USR varchar2 8 DATE_TRN varchar2 8 TIME_TRN numeric 4 – karan arora Jul 11 '17 at 05:41
@gwarah I have rename the columns because of comment length. If you are ok with above approach (which is not fully explained ), we can discuss further. – karan arora Jul 11 '17 at 05:46
sorry for our late feedback. Have you seen the comments bellow the question? There is a suggested solution and our preprocessing solution. – gwarah May 21 '18 at 18:51

Pentaho: how to retrieve data stored in two files: one for mapping and other for rows

1 Answers1