What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
- This would run on one machine (not distributed).
- The machine that I'd be running this on would have several processors.
- The files I'd be sorting could be larger than the physical memory of the machine.
- A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
- An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
- I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
- This isn't homework!
Thanks! ♥