Pentaho pdi how to get xml from many tgz

Question

I would like to get XML files from many .tgz.

I already try this tp read one tgz file :

folder/file               |   regex 
tgz:C:\tmp\file_01.tgz!   |   .*\.xml

But in my case, i don't know how many .tgz files there can be. I try something like this but it doesn't work:

tgz:C:\tmp\file_*.tgz!

score 0 · Answer 1 · answered Apr 24 '18 at 11:58

0

Create a Job (not a transformation), and use the Unzip file with regex to unzip all the files in a temp directory. On the Advanced tab click Add extracted file to result, to save the list of the extracted file internally.

Then let the job execute a transformation whose first step is a Get row from results, to get the list of file names in a field, and give the flow to the Get data from XML. On that step specify you want to get the file name from the previous step.

Then go back to the Job and put a Delete file name from result. Make sure the arrow is green, so the extracted XML will be deleted only if it was read successfully.

Like that, you tmp directory will contain the file that where not read. Pretty easy to maintain on the long run. Especially if on the Unzip file, you check the option that automagically appends the date on the extracted xml.

answered Apr 24 '18 at 11:58

AlainD

6,187
3
17
31

thanks but i have this error : Unable to get VFS File object for filename 'zip:file:///C:\tmp\file_01.tgz' : Could not open Zip file "C:\tmp\file_01.tgz". PDI can't unzip tgz file ? – faradole Apr 24 '18 at 13:26
Yes, on linux at least. As you are on Windows, you can install 7-zip [https://www.7-zip.org/]. Just install and try your ETL asis. – AlainD Apr 25 '18 at 17:02
Im on Windows but with 7-zip i can't open tgz file "Unsupported command: C:\tmp\..." – faradole Apr 26 '18 at 08:26

Pentaho pdi how to get xml from many tgz

1 Answers1