3

I'm looking for an algorithm to sort a large number of items using the fewest comparisons. My specific case makes it unclear which of the obvious approaches is appropriate: the comparison function is slow and non-deterministic so it can make errors, because it's a human brain.

In other words, I want to sort arbitrary items on my computer into a list from "best" to "worst" by comparing them two at a time. They could be images, strings, songs, anything. My program would display two things for me to compare. The program doesn't know anything about what is being compared, its job is just to decide which pairs to compare. So that gives the following criteria

  1. It's a comparison sort - The only time the user sees items is when comparing two of them.
  2. It's an out-of-place sort - I don't want to move the actual files, so items can have placeholder values or metadata files
  3. Comparisons are slow - at least compared to a computer. Data locality won't have an effect, but comparing obvious disparities will be quick, similar items will be slow.
  4. Comparison is subjective - comparison results could vary slightly at different times.
  5. Items don't have a total order - the desired outcome is an order that is "good enough" at runtime, which will vary depending on context.
  6. Items will rarely be almost sorted - in fact, the goal is to get random data to an almost-sorted state.
  7. Sets usually will contain runs - If every song on an album is a banger, it might be faster because of (2) to compare them to the next album rather than each other. Imagine a set {10.0, 10.2, 10.9, 5.0, 4.2, 6.9} where integer comparisons are fast but float comparisons are very slow.

There are many different ways to approach this problem. In addition to sorting algorithms, it's similar to creating tournament brackets, and voting systems. As that table illustrates, there are countless ways to define and solve the problem based on various criteria. For this question I'm only interested in treating it as a sorting problem where the user is comparing two items at a time and choosing a preference. So what approach makes sense for either of the two following versions of the question?

  1. How to choose pairs to get the best result in O(n) or fewer operations? (for example compare random pairs of items with n/2 operations, then use n/2 operations to spot check or fine-tune)
  2. How to create the best order with additional operations but no additional comparisons (e.g. similar items are sorted into buckets or losers are removed, anything that doesn't increase the number of comparisons)

The representation of comparison results can be anything that makes the solution convenient - it can be dictionary keys corresponding to the final order, a "score" based on number of comparisons, a database, etc.

Edit: The comments have helped clarify the question in that the goal is similar to something like bucket sort, samplesort or the partitioning phase of quicksort. So the question could be rephrased as how to choose good partitions based on comparisons, but I'm also interested in any other ways of using the comparison results that wouldn't be applicable in a standard in-place comparison sort like keeping a score for each item.

Salvatore Ambulando
  • 402
  • 1
  • 6
  • 13
  • Are you looking for a [statistical model](https://en.wikipedia.org/wiki/Bayesian_network) approch? – Neil Mar 16 '22 at 20:59
  • @Neil Hmm, I don't think so, but this is interesting reading! – Salvatore Ambulando Mar 16 '22 at 22:40
  • I have the idea that what you want is clustering, not sorting. Or maybe supervised/unsupervised learning. You said compare two objects, but on what criteria? How would you sort, say, an image? – Neil Mar 17 '22 at 03:55
  • 1
    @Neil No, I want sorting. The user interface is "here are two things, which one is better (based on whatever criteria you want)". The question is how to choose the two things to show to the user to get the best results from the fewest comparisons. Maybe I'll try and clarify the question if that isn't coming across clearly enough. – Salvatore Ambulando Mar 17 '22 at 12:58
  • @Neil I've updated the question. To address your idea specifically, clustering could give interesting results too and I'm open to that. I tried to be very specific because I didn't want to run afoul of the "one question per post" rule. The main criteria is that the user is comparing two items and the program is only handling which items to compare and the results of the comparison. Afaik any machine learning would require the app to learn about the data, which is interesting but out of scope. I'm happy to hear about any algorithms that give interesting results within those constraints! – Salvatore Ambulando Mar 17 '22 at 13:09
  • Thanks for the clarification. [Merge-insertion sort](https://en.wikipedia.org/wiki/Merge-insertion_sort) "uses fewer comparisons in the worst case than the best previously known algorithms," but that's a lot of comparing very similar objects. I wonder if (a few) pivots of quick-sort would be helpful; deciding which is better is easier if there's a constant pivot, I think. – Neil Mar 18 '22 at 06:14
  • Yeah, that's pretty much what I was looking at when I started on this question, but the issue is how to take advantage of the fuzzy ordering to reduce comparisons. The multiple pivot idea is an interesting one that might help! – Salvatore Ambulando Mar 18 '22 at 13:15
  • 1
    The pivot is carefully chosen in [Bentley, McIlroy, 1993](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf); this has influenced `qsort`, and I would imagine it would be even more important. `O(n)` is tricky. Do you want to take into account meta-data like file creation time, directory, file name similarity, to automatically group them into clusters? – Neil Mar 18 '22 at 15:16
  • In practice I think that could be very useful especially with data that inherently contains runs like albums or photos. However since I'm trying to find a context-agnostic solution (which may or may not exist) I think it should be excluded for now. Wiki says Tournament Sort is used to get data for external sorting algorithms which seems like it's going for a similar "almost sorted" result, but I'm having trouble understanding how it differs from heapsort. – Salvatore Ambulando Mar 18 '22 at 15:30
  • @Neil Your comments really helped clarify the question for me, and I think maybe comparing it to in-place algorithms was a mistake. It's probably closer to bucket sort / samplesort and the real question is how to choose buckets/pivots based on a few comparisons. Then use any suitable algorithm to sort buckets as time allows. I'll do some reading on methods to choose useful pivots and see where that gets me. – Salvatore Ambulando Mar 18 '22 at 16:05
  • Another interesting option is [shell sort](https://en.wikipedia.org/wiki/Shellsort), which is insertion sort with gaps. "starting with far apart elements, it can move some out-of-place elements into position faster"; this allows partial sorting, if the process is taking too long, just stop. It's also fairly simple. – Neil Mar 18 '22 at 16:44
  • 1
    I tried to narrow down the question [here](https://math.stackexchange.com/questions/4410359/algorithim-to-choose-comparison-pairs-for-topological-sorting) and as I keep trying to figure it out I realize your suggestion of a statistical model isn't too far off (though it's beyond my skillset). Creating disconnected subtrees and then using the number of wins as a way to infer likely places in the final sort before attempting to merge them seems promising. I think it's similar to picking a median of medians for qsort. I just can't shake the feeling there's a known answer – Salvatore Ambulando Mar 24 '22 at 15:29
  • Topological sorting gives you a DAG, not a list, which may be what you want. Not knowing the context, I would be surprised if you could get more accurate then random joining to create a MST. It sounds to me like this is not even guaranteed to be in topological order. I would try shell sort first: it's very simple. It is not ideal in comparisons, but it's a baseline for comparing more experimental methods. – Neil Mar 25 '22 at 03:41

0 Answers0