0

I'm still pretty new on Pentaho Soon. I'd like to know if this would be possible to be achieved.

In the past I had many bad experiences with SSIS, so I decided at the time to develop my own ETL using C#. In practice, .Net only does the extract step. Data is inserted to DB and the rest of ETL is done by MSSQL Query Engine from normal SQL in txt files, that .Net reads and executes in MSSQL.

My idea is to move from .Net to Java, and use Spoon features. The advantage is that I'd have Spoon's components avaiable. Table output in example.

One issue I have is that some flat files come currupted. In example, letters with accent are replaced by separator character, so I can't just tell ETL tool to split columns using the separator, I first need to verify how many separators are present and handle it if there are more than expected.

I also need to verify if a file was already processed or not, if it had finished being copied thru network, etc. I also don't want SQL code to be stored on Execute SQL Script components, I want them saved on normal txt files so that Subversion can track changes on them, and ETL tool should read these files and send them to MSSQL to be executed.

So, my idea would be to use Spoon's GUI to build the ETL normally. And then use Eclipse to develop over its SDK to customize the execution. In example, I'd use standard Text File Input component in GUI, but then my jar would have its own class extending the standard one, that customizes the method responsible for receiving a line string and splitting it into fields, and handle any issue.

My jar would instantiate my class instead of Spoon's and provide its object to the engine.

Is it viable, or too complext to bother?

Hikari
  • 3,797
  • 12
  • 47
  • 77

1 Answers1

1

It seams the right approach to me.

In the case of accents being replaced by separators, look first if it is not a file character set issue. If yes and it varies from file to file, put the charset in a variable or some logic by file name or directory.

If it is not charset issue, then read the file as one string rows and drop it in a javascript step to use regexp and split. The javascript is easier to handle than the java or regex steps which are good choices but less flexible to start with.

To look if the file have been processed or not, reinvent the wheel based on the following schema: put the processed file in a given directory, then use the File exists step to process them or not (adapt freely to your case).

To execute SQL script contained in file, read their content in a field and give it to the Dynamic SQL Row step.

You can put all that in a custom plugin, so each time you open spoon you will have your own step. This is really easy to do, but to tell the truth, there is a long time I do not do that any more. It is much easier to use the Transformation Executor step in an other transform or in a job.

And while we are in that kind of questions, let me mention the Chapter 22 of the Pentaho Kettle Solution book which explains (with example) how to call kettle from java. That is great fun.

-- If you need further support, accept this answer, and post a new question. --

AlainD
  • 6,187
  • 3
  • 17
  • 31
  • Thanks a lot for the tips! Indeed, it'd be better to develop new components in Eclipse and attach them to Spoon than to have a jar executing Spoon and doing the work! Is that done with Transformation Executor? Could you point me to a tutorial for that? – Hikari Aug 22 '17 at 15:51
  • Of course, Java would be the last solution. First I'd try to handle it using Modified JS Value. Is it possible to keep the js code in a txt file, as I wanna do with sql code? I tried Dynamic SQL Row, it seems to require the sql file to be in a field. Isn't there a place to add job parameters, or a browse button to point for the sql file? Yes, I'm noob yet. – Hikari Aug 22 '17 at 15:54
  • You have an example for the `Transformation executor` among other in the sample directory which sits just near your spoon.bat (spoon.sh). Note: The difference between Job and Transformation is that in a trans all the steps runs at the same time while in a job they are executed in sequence. This allows you to control parallelism but may rise surprises (blocking write locks, involuntary truncate, variable assigned too late,...) – AlainD Aug 23 '17 at 10:01
  • No, the `Modified Javascript Value` is not dynamic, in the sense that you cannot read the code from a field or a variable. The reason is that the js code has to be in synch with the metadata (field name, type, etc,...) – AlainD Aug 23 '17 at 10:08
  • Yes, the **`Dynamic SQL Row`** execute the sql contained in a field. Note that you can use ${variable} in the script which will replaced by the variable defined in the job (or the parameter value you specify when you press the run button). The **`Execute SQL Script`** executes a script contained in a variable (Press Crtl-space for a drop down list of variables), provided the variable was specified as a transformation parameter (Right-click anywhere, Properties, Parameters,..). The `Execute row SQL Script` allows you to give the SQL in a file, but this last step will require some trial and err – AlainD Aug 23 '17 at 10:23
  • When you have long questions ask new question rather than comments. – AlainD Aug 23 '17 at 10:23
  • Thanks a lot! I'm gonna make another question related to this one. – Hikari Aug 24 '17 at 13:04