How to sort a very large array in C

Question

I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.

What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.

Where are the values coming from, and going to? Do you have them all in memory to begin with? — Graham Borland, Apr 07 '11 at 21:47
Where are they currently stored, on disk? I assume you are not running a 64bit system? — Yann Ramin, Apr 07 '11 at 21:48
4 million times 8 is ~32 megabytes. It doesn't need to be contiguous either -- you just need contiguous address space for the mapped addresses of a lot of 4K blocks. IOW, malloc/qsort should be fine. — Jerry Coffin, Apr 07 '11 at 21:48
4 million * 8 bytes = 32MB. This is not too much for `malloc()`. — pajton, Apr 07 '11 at 21:49
@Yann Ramin. They are stored on disk in a raw file. I wouldn't mind a disk-based qsort but that seems harder to implement than disk-based bsearch which I have done before. — hippietrail, Apr 07 '11 at 21:52
@Jerry Coffin: I thought the C `qsort()` function only worked with a contiguous array. — hippietrail, Apr 07 '11 at 21:53
@hippietrail: The array will appear contiguous to your code, but that's only an illusion created by the memory management hardware. In reality, it's allocated as smaller (4k, or on some hardware, 8K) blocks. Bottom line: unless you're on a system with *really* constrained memory, it won't be a problem. — Jerry Coffin, Apr 07 '11 at 21:58
I once sorted 4 *billion* long longs. Now *that* took alternative mechanisms. But in the end I still used `qsort()` on batches of 19 million entries at a time... — Ben Jackson, Apr 07 '11 at 22:03

score 11 · Accepted Answer · answered Apr 07 '11 at 21:47

11

Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.

If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.

(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)

answered Apr 07 '11 at 21:47

Gareth McCaughan

19,888
1
41
62

The only thing I'd add is that if the data is already raw binary on disk, you could `mmap` (or equivalent) instead of loading and writing it back. But if you care about the safety of your data in the event of a system failure, this is probably a bad idea. – R.. GitHub STOP HELPING ICE Apr 07 '11 at 22:57
`qsort()` did work fine of course - I don't know what I was worried about. I probably wasn't aware just how much the memory management provided since I moved from C to scripting languages when a couple of megabytes was lots of RAM. – hippietrail Apr 08 '11 at 08:15

score 1 · Answer 2 · answered Apr 07 '11 at 21:49

1

32 MB? thats not too big.... quicksort should do the trick.

answered Apr 07 '11 at 21:49

Keith Nicholas

43,549
15
93
156

score 0 · Answer 3 · answered Apr 07 '11 at 21:54

Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).

That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.

You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but

duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)

(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)

Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case

Bloddy ... :) I am blind today. Okay, there must be equivalent approaches in C; I hope this still has some value — sehe, Apr 07 '11 at 22:00

How to sort a very large array in C

3 Answers3

Linked