6

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):

Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"

I know that:

  1. This would run on one machine (not distributed).
  2. The machine that I'd be running this on would have several processors.
  3. The files I'd be sorting could be larger than the physical memory of the machine.
  4. A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
  5. An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
  6. I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
  7. This isn't homework!

Thanks! ♥

Sophie
  • 63
  • 3

4 Answers4

3

This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
1

Suggest the following resources:

Merge Sort: http://en.wikipedia.org/wiki/Merge_sort

Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)

Chris Walton
  • 2,513
  • 3
  • 25
  • 39
0

No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.

So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.

In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.

0

A standard merge sort approach will work. The common schema is

  1. Split the file into N parts of roughly equal size
  2. Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
  3. Merge the sorted parts
Maxim Razin
  • 9,114
  • 7
  • 34
  • 33