0

I have an ETL requirement like:

I need to fetch around 20000 records from a table and process each record separately.(The processing of each record involves a couple of steps like creating a table for each record and inserting some data into it). For prototype I implemented it with two Jobs(with corresponding transformations). Rather than table I created a simple empty file. But this simple case also doesn't seem to work smoothly. (When I do create a table for each record the Kettle exits after 5000 reocrds)

Flow

When I run this the Kettle goes slow and then hangs after 2000-3000 files though processing is complete after a long time though Kettle seems to stop at some time. Is my design approach right?. When I replace the write to file with actual requirement like creating a new table(through sql script step) for each id and inserting data into it, the kettle exits after 5000 records. What do I need to do so that the flow works. increasing the Java memory(Xmx is already at 2gb)?. Is there any other configuration I can change? Or is there any other way? Extra Time shouldn't be a constraint but the flow should work.

My initial guess was since we are not storing any data the prototype atleast should work smoothly. I am using Kettle 3.2.

  • It is hard to guess how yours flow implemented. Just a chips instead of complete picture. Which kind of error do u have. OutOfMemory exception? – simar Sep 05 '16 at 08:46
  • If u have some complex data processing including file system calls for each row, u have at least to limit number of rows in one set send to transformation. Configure transformation in transformation properties -> "Miscellaneous" tab, number of rows in row-set. – simar Sep 05 '16 at 08:49
  • When you use 'copy rows to result' that will keep all records in memory - so if data is big- most probably will be OutOfMemory. Redesign and get rid of copying rows in jobs - copy only if amount of rows is small - don't escape transformations otherwise. – Dzmitry Prakapenka Sep 05 '16 at 10:35
  • Thanks guys!There is not much in the flow. Simply fetching id(around 10000 that are being copied to rows) from DB and looping through using for each input row. In processing I am just creating a file. Seems Kettle doesn't release the memory. When I see it through a memory tool like visualVM/Jconsole there is a linear rise in memory that gets released only when the job with loop ends. –  Sep 06 '16 at 10:36
  • This link helps http://forums.pentaho.com/showthread.php?187491-PDI-memory-leak-when-job-workflow-is-quot-too-long-quot&p=408653&posted=1#post408653 a bit but adding a dummy transformation with clear list of result rows/files seems to slow things. –  Sep 06 '16 at 10:36
  • in-fact the memory doesn't get released even after the job ends –  Sep 06 '16 at 12:39

1 Answers1

0

I seem to remember this is a known issue/restriction, hence why job looping is deprecated these days.

Are you able to re-build the job using the transformation and/or job executor steps? You can execute any number of rows via those stops.

These steps have their own issues - namely you have to explicitly handle errors, but it's worth a try just to see if you can achieve what you want. It's a slightly different mindset, but a nicer way to build loops than the job approach.

Codek
  • 5,114
  • 3
  • 24
  • 38