How to sort a big file (not fitting in RAM)

Question

Say there is an algorithm X that requires 2 steps for the final output to a file.

collect data
sort data

Let us also say that the collected data is too large to be held in RAM and is written to a file before step 2 takes action.

For an example, take a file with 500GB that contains numbers, as output by step 1. One number in each line. Step 2 must sort the lines in ascending order.

How would step 2 go about efficiently sorting the numbers without reading the input file as a whole?

http://stackoverflow.com/questions/4358087/sort-with-the-limited-memory — DrV, Jun 19 '14 at 22:32
possible duplicate of [file based merge sort on large datasets in Java](http://stackoverflow.com/questions/6314598/file-based-merge-sort-on-large-datasets-in-java) — Jim Mischel, Jun 20 '14 at 02:42
This can be useful as well http://stackoverflow.com/q/22807456/660408 — gkiko, Jun 20 '14 at 11:22
If you are implementing both the collector and sorter at the same time, why not sort the data as it arrives piece by piece (insertion sort etc) so that the resulting file contains a sorted list? http://softwareengineering.stackexchange.com/questions/302541/sort-a-list-while-putting-together-or-after — Ryan Griggs, Dec 27 '16 at 00:57

score 2 · Answer 1 · answered Jun 19 '14 at 21:27

2

Most efficient is to increase your swap space by 500 GB and do a single sort, letting the OS memory manager handle the cache.

An alternative is to divide the data into pieces that do fit, say 250 2GB files. Sort each one, then do merge sort on the result.

answered Jun 19 '14 at 21:27

stark

12,615
3
33
50

1

The first approach is liable to lead to horrendous thrashing, I would imagine. – Oliver Charlesworth Jun 19 '14 at 21:37
1

The second approach is the correct one. Sort as much as you can in RAM with an efficient algorithm, then merge the chunks. The IO will be the bottleneck, so optimising the merge reads and writes may be a good idea. In practice, using a non-HDD storage (SSD, cloud) may help. – DrV Jun 19 '14 at 22:35
Going to swap has horrible performance problems. Do not take the first approach. – btilly Sep 23 '20 at 16:46

Paddy3118 · Answer 2 · 2020-09-28T08:31:36.150

"GNU CoreUtils" available as source, or built-in to Linux distributions are available pre-compiled for Windows, contains the Gnu split and Gnu sort routines.

If your data can be arranged so that each record for sorting is on a separate line, then the split function will split one large file into multiple smaller files. Each smaller file can be individually sorted in memory using the Gnu sort function, and finally all the sorted smaller files can be merge sorted back into one large file with another option of Gnu sort.

See:

How to sort a big file (not fitting in RAM)

2 Answers2