73

I was asked this interview question recently:

You're given an array that is almost sorted, in that each of the N elements may be misplaced by no more than k positions from the correct sorted order. Find a space-and-time efficient algorithm to sort the array.

I have an O(N log k) solution as follows.

Let's denote arr[0..n) to mean the elements of the array from index 0 (inclusive) to N (exclusive).

  • Sort arr[0..2k)
    • Now we know that arr[0..k) are in their final sorted positions...
    • ...but arr[k..2k) may still be misplaced by k!
  • Sort arr[k..3k)
    • Now we know that arr[k..2k) are in their final sorted positions...
    • ...but arr[2k..3k) may still be misplaced by k
  • Sort arr[2k..4k)
  • ....
  • Until you sort arr[ik..N), then you're done!
    • This final step may be cheaper than the other steps when you have less than 2k elements left

In each step, you sort at most 2k elements in O(k log k), putting at least k elements in their final sorted positions at the end of each step. There are O(N/k) steps, so the overall complexity is O(N log k).

My questions are:

  • Is O(N log k) optimal? Can this be improved upon?
  • Can you do this without (partially) re-sorting the same elements?
Mat
  • 202,337
  • 40
  • 393
  • 406
polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
  • 9
    I wonder if you couldn't take advantage of the fact that after Step 1, [k..2k) are in sorted relative to each other? So instead of sorting [k..3k), you could sort [2k..4k) and merge the last half of the 1st ([k..2k)) with the first half of the second ([2k..3k)). – Phil Apr 28 '10 at 04:33
  • 1
    Yes it is optimal. Simple proof it meets the lower bound is we randomly permute each block of k elements: (k)(k)(k)(k). Thus we need to do N/k sorts each taking k*log(k). Can we do it without resorting elements? Yes. As above sort each block of k elements independently. Then serially go over and in place merge block i with block i+1. Except on boundaries we can also do the merges independently in parallel. Thanks for the question. This algorithm is actually useful for a problem I am working on :) – Chad Brewbaker Jul 02 '10 at 05:44
  • 2
    @Chad: In fact Moron gives an even simpler proof: consider the possibility that k = n. Then an algorithm faster than O(n log k) would contradict the known optimality of O(n log n) algorithms for comparison-based sorting. – j_random_hacker Jul 12 '10 at 00:26
  • 2
    I prefer constructive proofs :) http://en.wikipedia.org/wiki/Necessary_and_sufficient_condition – Chad Brewbaker Jul 30 '10 at 21:04
  • @j_random_hacker I disagree with your proof. Imagine I have an algorithm that runs in O(k log(k)). This is definitely much faster than O(n log(k)). And yet in the case k = n, there is no contradiction. – Stef Mar 05 '23 at 13:41
  • @Stef: I don't follow, sorry. Is the input size still given by n? If so, then I think it's clear that unless k is restricted to be Omega(n) (in other words, if k can still be chosen independent of n, as it is in the original problem statement), there can be no correct algorithm that runs in time O(k log(k)): We could always construct an input with n much larger than k log(k), and run out of time even to read all the input. – j_random_hacker Mar 06 '23 at 02:27
  • @j_random_hacker Yes, your last comment is a good proof that O(k log(k)) is not possible. I was talking about your previous comment: *" In fact Moron gives an even simpler proof: consider the possibility that k = n. Then an algorithm faster than O(n log k) would contradict the known optimality of O(n log n) algorithms for comparison-based sorting."*. That one is incorrect. – Stef Mar 06 '23 at 09:43

5 Answers5

42

As Bob Sedgewick showed in his dissertation work (and follow-ons), insertion sort absolutely crushes the "almost-sorted array". In this case your asymptotics look good but if k < 12 I bet insertion sort wins every time. I don't know that there's a good explanation for why insertion sort does so well, but the place to look would be in one of Sedgewick's textbooks entitled Algorithms (he has done many editions for different languages).

  • I have no idea whether O(N log k) is optimal, but more to the point, I don't really care—if k is small, it's the constant factors that matter, and if k is large, you may as well just sort the array.

  • Insertion sort will nail this problem without re-sorting the same elements.

Big-O notation is all very well for algorithm class, but in the real world, constants matter. It's all too easy to lose sight of this. (And I say this as a professor who has taught Big-O notation!)

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
  • 6
    Can you explain more of what he said instead of just linking to it? References in answers are awesome, but substantive content on stackoverflow itself is even awesomer! – polygenelubricants Apr 28 '10 at 04:44
  • 1
    Also, I haven't read his definition of what "almost sorted" means, but I've seen variants that define it as "at most `k` elements are misplaced". Here, all `N` elements may be misplaced (bounded by `k`). – polygenelubricants Apr 28 '10 at 04:51
  • 5
    Well, even in the real world, when the input sizes grow large enough, the asymptotics matter more than the constants. :-) Insertion sort has a very good constant, but the fact that O(n log k) is asymptotically better than O(nk) *can* matter — for example, what if k ≈ √n as n grows large? (It also depends on what the interviewer was looking for. :p) – ShreevatsaR Apr 28 '10 at 05:08
  • 2
    I could not help but note that Bubble Sort would also have a O(nk) complexity. It's not that often that it can be cited :) – Matthieu M. Apr 28 '10 at 06:28
  • 1
    This answers neither of the questions asked! Plus the link is pretty useless. -1 till you edit it to make it more informative, professor. –  Apr 28 '10 at 23:02
  • @polygenelubriants, @Moron: I don't know much more than what I said already, but what little more I know, I've added. – Norman Ramsey Apr 28 '10 at 23:39
  • 4
    @Norman: Perhaps you can actually point to the paper/book chapter which has the claim about almost sorted arrays? Just a link to the homepage is practically useless. Also, just saying insertion sort will nail it is useless, if k = sqrt(n) for instance. I really don't understand why this answer has so many votes. –  Apr 28 '10 at 23:47
  • 1
    @Norman: There is a big range of values k can take, just saying if k is not 'small' sort the whole thing is not good enough. Take k = log(n) for instance. –  Apr 28 '10 at 23:55
  • 3
    @Moron: If k = log n then k is small. log base 2 of one million is just 20. @Everyone: SO is a *programming* site, not a *CS theory* site! – Norman Ramsey Apr 29 '10 at 02:15
  • @Norman: Then take k = (log(n))^20. 20^20 is huge compared to 20*log(20), even while programming and not just in theory. –  Apr 29 '10 at 03:11
  • 5
    While SO is a programming site, I think questions still deserve correct answers. For example, we ought not to say that all algorithms are O(1), even though in programming all running times encountered are bounded by a constant (like 10^1000). More to the point here, *whatever* the constant of insertion sort, there is some sufficiently large k after which insertion sort is no longer faster, and we cannot "may as well" sort the whole array. (I really doubt, even with a trillion elements (k=40) whether insertion sort is faster.) – ShreevatsaR Apr 30 '10 at 05:44
23

If using only the comparison model, O(n log k) is optimal. Consider the case when k = n.

To answer your other question, yes it is possible to do this without sorting, by using heaps.

Use a min-heap of 2k elements. Insert 2k elements first, then remove min, insert next element etc.

This guarantees O(n log k) time and O(k) space and heaps usually have small enough hidden constants.

  • +1. I also came up with the min-heap approach (can't you just limit the size to `k` instead of `2k`?), and was told to improve it so that it doesn't use the extra space. – polygenelubricants Apr 29 '10 at 00:24
  • 2
    @polygenelubricants: You can do this in-place. Start from the far end, and use a max-heap instead of a min-heap. Heapify that final block of 2k elements in-place. Store the first extracted element in a variable; subsequent elements go in the positions vacated immediately before the final block of 2k (which contains the heap structure), similar to regular heapsort. When only 1 block remains, heapsort it in place. A final O(n) pass is needed to "rotate" the final block back to the initial block. The rotation is not trivial but can be done in O(n) and O(1) space. – j_random_hacker Jul 11 '10 at 14:33
  • @polygenelubricants: Strange I seem to have missed your comment to this answer! @j_random_hacker: Seems right. –  Jul 11 '10 at 16:42
  • BTW @Moron, I really like your argument that O(n log k) is optimal: "Consider k = n". Doesn't get much simpler than that! – j_random_hacker Jul 12 '10 at 00:18
  • @j_random_hacker: Yes :-) btw, your solution to the in-place problem was good! –  Jul 12 '10 at 04:49
  • 3
    @j_random_hacker Can you explain why the heap has to be of size 2k? In the examples I've done k+1 is big enough. – JohnS Jan 30 '13 at 06:58
  • @JohnS: I believe you're right: k+1 should be enough. Also moving forwards and using a min-heap would also work -- dunno why I went backwards! – j_random_hacker Jan 30 '13 at 09:57
  • I think time complexity of using heap will be: O(k) + O((n-k)*logK), am I right? – Hengameh Aug 04 '15 at 14:48
  • @Hengameh Can you explain how the complexity is o(k) + O((n-k)*logK) ? – user2896235 Feb 03 '18 at 19:10
  • @JohnS you look forwards and backwards K elements for potential out of place elements. – Union find Jan 20 '21 at 05:04
  • *"If using only the comparison model, O(n log k) is optimal. Consider the case when k = n."* <<< This is not a proof. Imagine I have an algorithm that runs in O(k log(k)). This is definitely much faster than O(n log(k)). And yet in the case k = n, there is no contradiction. – Stef Mar 05 '23 at 13:44
8

Since k is apparently supposed to be pretty small, an insertion sort is probably the most obvious and generally accepted algorithm.

In an insertion sort on random elements, you have to scan through N elements, and you have to move each one an average of N/2 positions, giving ~N*N/2 total operations. The "/2" constant is ignored in a big-O (or similar) characterization, giving O(N2) complexity.

In the case you're proposing, the expected number of operations is ~N*K/2 -- but since k is a constant, the whole k/2 term is ignored in a big-O characterization, so the overall complexity is O(N).

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
8

Your solution is a good one if k is large enough. There is no better solution in terms of time complexity; each element might be out of place by k places, which means you need to learn log2 k bits of information to place it correctly, which means you need to make log2 k comparisons at least--so it's got to be a complexity of at least O(N log k).

However, as others have pointed out, if k is small, the constant terms are going to kill you. Use something that's very fast per operation, like insertion sort, in that case.

If you really wanted to be optimal, you'd implement both methods, and switch from one to the other depending on k.

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
8

It was already pointed out that one of the asymptotically optimal solutions uses a min heap and I just wanted to provide code in Java:

public void sortNearlySorted(int[] nums, int k) {
  PriorityQueue<Integer> minHeap = new PriorityQueue<>();
  for (int i = 0; i < k; i++) {
    minHeap.add(nums[i]);
  }

  for (int i = 0; i < nums.length; i++) {
    if (i + k < nums.length) {
      minHeap.add(nums[i + k]);
    }
    nums[i] = minHeap.remove();
  }
}
Ivaylo Toskov
  • 3,911
  • 3
  • 32
  • 48