A good sorting algorithm for mostly-sorted data that doesn't all fit into memory?

Question

In case you are given:

certain amount of data
memory with size half of the data size
part of the data is sorted
you do not know the size of the sorted data.

Which sorting algorithm would you choose? I am debating between insertion and quicksort. I know that the best case for insertion sort is O(n), but the worst case is O(n²). Also, considering the fact the memory is limited, I would divide the data in two parts, and on each of them do quicksort, then merge everything together. It will take O(n) time to split the data, O(n) to merge the data, and O(n log n) to sort the data using quicksort, for a net runtime of O(n log n).

Does anyone have any suggestions on how to improve this?

no, revising data structures. I just found some awesome lessons on you tube, from UCBerkley and I am trying to exercise myself with sorting algorithms. — FranXh, Feb 29 '12 at 03:31
@Mohamed But heap sort requires an array, which means that an array of all the data I have would exceed the size of my memory? Or should I still divide the data and then sort them using heap sort? Wouldn't it still be the same time complexity anyways? — FranXh, Feb 29 '12 at 03:38
Given how cheap memory is these days, I would just make sure I have enough memory. You can buy a machine with 32 GB for a reasonable price and machines up to 1 TB for those who have a use for it. — Peter Lawrey, Feb 29 '12 at 08:28

templatetypedef · Accepted Answer · 2012-02-29T18:45:16.200

Your mergesort-like approach seems very reasonable. More generally, this type of sorting algorithm is called an external sorting algorithm. These algorithms often work as you've described - load some subset of the data into memory, sort it, then write it back out to disk. At the end, use a merging algorithm to merge everything back together. The choice of how much to load in and what sorting algorithm to use are usually the dominant concerns. I'll focus mostly on the sorting algorithm choice.

Your concerns about quicksort's worst-case behavior are generally speaking nothing to worry about, since if you choose the pivot randomly the probability that you get a really bad runtime is low. The random pivot strategy also works well even if the data is already sorted, as it has no worst-case inputs (unless someone knows your random number generator and the seed). You could also use a quicksort variant like introsort, which doesn't have the worst-case behavior, as your sorting algorithm in order to avoid this worst-case.

That said, since you know that the data is already partially sorted, you may want to look into an adaptive sorting algorithm for your sorting step. You've mentioned insertion sort for this, but there are much better adaptive algorithms out there. If memory is scarce (as you've described), you might want to try looking into the smoothsort algorithm, which has best-case runtime O(n), worst-case runtime O(n log n), and uses only O(1) memory. It's not as adaptive as some other algorithms (like Python's timsort, natural mergesort, or Cartesian tree sort), but it has lower memory usage. It's also not as fast as a good quicksort, but if the data really is mostly sorted it can do pretty well.

Hope this helps!

score 1 · Answer 2 · answered Feb 29 '12 at 03:53

1

On the face of it, I would divide & conquer with quicksort and call it a day. Many algorithms problems are over-thought.

Now, if you do have have test data to work with and really want a grasp on that, stick an abstract class in the middle and benchmark. We can hem and haw over things all day, but knowing that the data is already partially sorted, you'll have to test. Sorted data brings about worst-case performance in most quicksort implementations.

Consider that there are many sorting algorithms and some are suited better to sorted sets. Also, when you know a set is sorted, you can merge it with another set in n time. Thus, identifying chunks of sorted data first might save you a lot of time when you compare adding a single (n) pass and greatly reducing the chance of quicksort going to (n²) time.

answered Feb 29 '12 at 03:53

Jeff Ferland

17,832
7
46
76

True, totally forgot that quicksort does not behave well with sorted data. – FranXh Feb 29 '12 at 03:55
That said, quicksort can easily be modified to not have this pathological case on already-sorted sequences by using a different pivoting strategy (for example, choosing randomly). – templatetypedef Feb 29 '12 at 04:02
He's said he can't fit the data into memory, so quicksort is not a good choice. – Joel Feb 29 '12 at 22:16
1

@Joel- You could quicksort blocks of the data that do fit into memory, though, then merge them together. This is a perfectly reasonable approach. – templatetypedef Feb 29 '12 at 22:18
@Joel: "Divide & conquer"... parallel quick-sorting chunks with a merge at the end is very common for both speed and memory reasons. – Jeff Ferland Feb 29 '12 at 22:46
When dealing with external sorts the in-memory sort is irrelevant, as it takes up a tiny amount of time. Nowhere in the original solution did the answer say anything about merging. Given that quicksort is also a divide and conquer algorithm (dividing around the partition) you may understand my confusion. – Joel Feb 29 '12 at 22:48
@Joel No, I don't understand your confusion. Rather, I'm confused by you... you recognize in your last comment that it's an external sort so the total memory available isn't a primary limit, but you tell me before that it can't fit in memory so quicksort is a bad choice. – Jeff Ferland Feb 29 '12 at 22:53
Nowhere in your original comment did you say anything about merge sort or merging, or use the word merge in any way. Hence, quicksort as external sort, which is not possible. – Joel Feb 29 '12 at 22:56

A good sorting algorithm for mostly-sorted data that doesn't all fit into memory?

2 Answers2