Pre-sorting analysis algorithm?

Question

It's a well-known isssue with Quicksort that when the data set is in or almost in sort order, performance degrades horribly. In this case, Insertion Sort, which is normally very slow, is easily the best choice. The question is knowing when to use which.

Is there an algorithm available to run through a data set, apply a comparison factor, and return a report on how close the data set is to being in sort order? I prefer Delphi/Pascal, but I can read other languages if the example isn't overly complex.

This slowness of quicksort with pre-sorted sequences is only an issue, AFAIK, if the implementation is too simple with respect to the choice of a pivot element. See http://www.cprogramming.com/tutorial/computersciencetheory/quicksort.html for example. — Dirk, Dec 04 '09 at 20:06

Steve Jessop · Accepted Answer · 2009-12-04T21:00:09.477

As you'd expect quite a lot of thought goes into this. The median-of-three technique means that quicksort's worst case behaviour doesn't occur for sorted data, but instead for less obvious cases.

Introsort is quite exciting, since it avoids quicksort's quadratic worst case altogether. Instead of your natural question, "how do I detect that the data is nearly-sorted", it in effect asks itself as it's going along, "is this taking too long?". If the answer is yes, it switches from quicksort to heapsort.

Timsort combines merge sort with insertion sort, and performs very well on sorted or reverse-sorted data, and on data that includes sorted or reverse-sorted subsets.

So probably the answer to your question is, "you don't need a pre-pass analysis, you need an adaptive sort algorithm".

wowest · Answer 2 · 2009-12-04T20:19:37.117

There's also SmoothSort, which is apparently quite tricky to implement, but it varies between O(N log N) to O(N) depending on how sorted the data is to start with.

http://en.wikipedia.org/wiki/Smoothsort

Long tricky PDF: http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796a.PDF

However, if your data is truly huge and you have to access it serially, mergesort is probably the best. It's always O(N log N) and it has excellent 'locality' properties.

score 0 · Answer 3 · answered Dec 04 '09 at 20:07

0

I've not heard of any pre-sorting analysis but my opinion is that if you are going to go through the dataset to analyze it then you are already cutting into performance of your overall sorting time.

answered Dec 04 '09 at 20:07

martinatime

2,468
1
17
23

2

That's a good point, but if the analysis pass is O(n), it will not dominate the asymptotic sorting time. And if it can help avoid a O(n^2) worst-case sorting time, it could be a net benefit in sorting time for large datasets. – ddaa Dec 04 '09 at 20:14
1

@ddaa: That would be true for comparison sorts, but O(n) sorting is possible with Radix Sort, or Bucket Sort. If we include these algorithms the sort time could be dominated by the analysis time... – Restore the Data Dumps Dec 04 '09 at 20:28
1

@Jason: You wouldn't perform this analysis on data which you are about to bucket sort. The question is about choosing between quicksort and insertion sort, and you're planning to do neither... – Steve Jessop Dec 04 '09 at 20:59

score 0 · Answer 4 · answered Dec 04 '09 at 20:13

0

One possible solution is to take first, last and the middle element in the current sort range (during the QuickSort operation) and chose the middle one as the pivot element.

answered Dec 04 '09 at 20:13

gabr

26,580
9
75
141

Your best case is still O(N log N), where Insertion sort is O(N) for nearly sorted data. – wowest Dec 04 '09 at 20:15

score 0 · Answer 5 · answered Dec 04 '09 at 20:13

To fully analyze for the purpose of deciding which algorithm to use, you are going to do nearly the work of sorting. You could do something like check the values at a small percentage of random but increasing indexes (ie analyze a small sample of the items).

score 0 · Answer 6 · answered Dec 04 '09 at 20:38

You would still have to run through all records to determine if its sorted or not, so to improve performance, start with your first record and run though the rest until you either notice something not properly sorted, or reach the end of the list. If you find miss then only sort items from that position to the end (since the beginning of the list is already sorted).

At each item in the second part, see if the item is < than the last element in the first part and if so use an insertion sort into ONLY the first part. Otherwise Quicksort against all other items in the second part. This way the sort is optimized for the specific case.

score 0 · Answer 7 · answered Dec 04 '09 at 20:48

0

QuickSort beng a problem only when the data set is huge and already mostly sorted, I would use the following heuristics (pending a full blown solution):

Don't bother if data set size is below threshold.
If you have a quick (indexed) access to records(items) take a sample with 1 record in every N records and see if they are already sorted. Should be quick enough for a small sample and you can then decide to use quick sort or not.

answered Dec 04 '09 at 20:48

Francesca

21,452
4
49
90

but the sample fails if 1 record in every N is sorted, but +1 record in every N isn't. you may still have to read every record to see if ONE of them not sampled is out of order. – skamradt Dec 04 '09 at 21:40
Agreed, but there is statistically very little chance that the sample would deviate so much from the overall population, esp if you randomize a little bit N. – Francesca Dec 05 '09 at 00:34

score 0 · Answer 8 · answered Dec 06 '09 at 23:00

To make a conceptual point that people haven't yet made: Quicksort is a common-sense divide-and-conquer algorithm with an obvious bug in rare cases. Suppose that you want to sort a stack of student papers. (Which I have to do with some regularity.) In the quicksort algorithm, you pick some paper, the pivot. Then divide the other papers according to whether they are before or after the pivot. Then repeat that with the two subpiles. What's the bug? The pivot could be a name that is near one end of the list instead of in the middle, so that it doesn't accomplish much to divide it into two piles.

Merge sort is another divide-and-conquer algorithm that works in a different order. You can merge two sorted lists in linear time. Divide the papers into two equal or nearly equal piles, then recursively sort each one, then merge. Merge sort doesn't have any bugs. One reason that quicksort is more popular than merge sort is historical: Quicksort is fast (usually) and it works without any extra memory. But these days, it can be more important to save comparisons than to save memory, and the actual rearrangement is often abstracted by permuting pointers. If things had always been that way, then I suspect that merge sort would simply have been more popular than quicksort. (And maybe adding "quick" to the name was good salesmanship.)

From my POV the benefit of an in-place sort is not so much that it saves *memory*, as that it saves a memory allocation and hence cannot fail. So when sorting an array, quicksort/heapsort/insertion sort/bubble sort all have nicer user interfaces than mergesort. If mergesort were preferred to quicksort, then of course you could attempt to allocate the memory, and if it fails do a quicksort instead. If you're allocating a secondary array of pointers anyway and sorting that, then you're introducing the possibility of failure there, and hence might as well allow failure elsewhere. — Steve Jessop, Jul 09 '12 at 09:21
@SteveJessop That's a fair point. However, that concern, while still significant in some cases, is also a bit dated. I agree that it is non-trivial for the outer environment to fairly allocate memory to every client program or function that wants it. However, even that has gotten better over time in a lot of environments. — Greg Kuperberg, Dec 04 '12 at 15:22
I don't think it's really a question of fairness, so much as what happens when you run out, and whether you're robust to that. If allocation can fail then you write your program one way. If instead the OS blows something out of the water until it has enough memory to satisfy the request or the page fault on first access, then you write your program another way. Some languages take a middle path, where in theory you *could* catch out-of-memory exceptions and continue, but in practice you don't, you let the exception kill you. I suppose that could be considered the "up-to-date" way to do it ;-) — Steve Jessop, Dec 04 '12 at 16:53

Pre-sorting analysis algorithm?

8 Answers8