I have a use case where I need to sort a huge csv file say 10 million records, and write result to another file. Does hazelcast-jet / hazelcast provide any such external sort capability where I have limited RAM.
Asked
Active
Viewed 167 times
3
-
You could do it using custom processors, but that's an advanced usage. By using the built-in features it's not possible as of today (Jet 4.2). – Oliv Aug 05 '20 at 16:14
2 Answers
1
I'm currently working on introducing this feature as part of my GSoC project for Hazelcast Jet.
I used RocksDB state backend feature I developed earlier to do the sorting, so you can sort datasets larger than memory.
It's currently intended for batch use cases and to use it in the pipeline you call BatchStage.sort(keyFn)
where keyFn
extracts the key to sort on.
you can see the code on this PR

Mohamed Mandouh
- 11
- 1
0
10 milion records is nothing. I doubt it is hazelcast you really need here. Use the sort command coming with unix:
sort --field-separator=',' --key=2 source.csv > target.csv
You can wrap this command with Java code like this:
Process sortProcess = Runtime.getRuntime().exec(cmd);
If you insist on using Hazelcast, you need to keep your memory footprint low. Keep everything else except the columns you are sorting as byte array.

Alexander Petrov
- 9,204
- 31
- 70