Best way to sort 200-300 million records in Pentaho?

Question

I am working on this new task where my input csv file has about 200 to 300 million records my requirement is to sort the incoming data perform lookup's get the key value and insert into target table. One suggestion was to write a java plugin that will sort and store data in multiple temp files (say a million each) and retrieve from there. I was thinking to use sort step in pentaho and set the number of copies to start. But I am not sure whats the best approach. Can anyone suggest how to go about this. Thanks.

score 3 · Answer 1 · answered Jan 25 '17 at 17:45

3

I have used PDI to sort this many rows. The Sort step works fine, tho it can be finicky. I set my "Free memory threshold (in %)" to ~50. The step will generate gobs of temp files in your "Sort-directory"; if the job crashes (usually by running out of memory) you will have to remove the temp files manually.

If I had to do it again I'd probably set the "Compress TMP Files?" option since multiple failures ran me out of disk space. Good luck!

A custom sort in Java may give you better performance, but development time will be significant. If you're going to sort this many rows daily/weekly, whatever, it's probably worth it. If not, just stick with PDI's Sort.

answered Jan 25 '17 at 17:45

Brian.D.Myers

2,448
2
20
17

I tried few test's with pentaho sort step it works fine when I use a row generator step but freezes or crashes when I use a cvs input. Not sure why. – DUnkn0wn1 Jan 26 '17 at 17:15
1

Probably because of the size of your data set. Be sure "Lazy conversion" is turned off. Try tweaking the Free memory threshold setting. – Brian.D.Myers Jan 26 '17 at 19:35
1

Thanks. That helped. Reducing buffer size also helps. – DUnkn0wn1 Jan 30 '17 at 16:03

Best way to sort 200-300 million records in Pentaho?

1 Answers1