5

First, most places that claim to have an implementation of bucket sort are actually implementing counting sort. My question is about bucket sort as implemented on Geek Viewpoint and Wikipedia. I don't really get/like the hash function on Geek Viewpoint and I don't get the one on Wikipedia. Can someone explains a simpler method for creating a good hash function for bucket sort? Something an average person can understand and remember.

Katedral Pillon
  • 14,534
  • 25
  • 99
  • 199
  • for instance where does wikipedia get the `k` from for the call `msbits(array[i], k)`? – Katedral Pillon Jul 20 '15 at 21:03
  • As to this question, `k` determines the number of buckets (i.e. there are `2^k` buckets in total). You can view this as a hash function. But note that the expression `(floor(x/2^(size(x)-k)))` in Wiki is not quite correct, when `size(x)` is smaller than `k`. – WhatsUp Jul 20 '15 at 21:06
  • So you mean `n=2^(k-1)` where both `n` and `k` are the variables I see in the example on Wikipedia? – Katedral Pillon Jul 20 '15 at 21:09
  • Yes, `n = 2^(k - 1)` (or `n = 2^k`, depending on how you calculate the function `msbits` - in Wiki it's not correctly calculated). – WhatsUp Jul 20 '15 at 21:12
  • @WhatsUp There is no point choosing a `k` that is bigger than `size(x)`, as you'd end up with more buckets than possible values. – biziclop Jul 20 '15 at 21:15
  • @biziclop Please read the mentioned Wiki page before commenting... – WhatsUp Jul 20 '15 at 21:17
  • @WhatsUp I already have. That's why I'm saying there's no point choosing a `k` bigger than `size(x)`. – biziclop Jul 20 '15 at 21:21
  • @biziclop Then... please read more carefully... Note that the function `msbits` is used as a Hash function on values `array[i]`, which means that in the expression `size(x)`, the `x` stands for `array[i]`, an element in the array you want to sort. It has nothing to do with the SIZE of the array. – WhatsUp Jul 20 '15 at 21:25
  • @WhatsUp Exactly. Therefore if you're sorting 16 bit numbers for example, there's no point having more than `2^16` buckets because there's no way you can fill them all, regardless of the size of the array. And if you use exactly `2^16` buckets, `msbits(x,16) = x` and you get counting sort. – biziclop Jul 20 '15 at 21:30
  • @biziclop Seems you still don't get it. Let's say, my array has 65536 elements, which are all 16 bit numbers. Then, as you said, there is no point choosing `k` bigger than `16`. But let's say we choose `k = 8`. Then, what if there is an element in my array that equals `1`? To decide in which bucket it should go, you calculate `msbits(1, 8)`, and we are in the case `k = 8` and `x = 1`. This conversation is becoming too stupid, I won't explain any more. – WhatsUp Jul 20 '15 at 21:36
  • @WhatsUp I genuinely don't understand your point. You said the formula isn't correct because what if `k>size(x)`. I explained that there's no point choosing such a `k`. You seem to agree. So where's the problem with the formula then? :) – biziclop Jul 20 '15 at 21:40
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/83807/discussion-between-whatsup-and-biziclop). – WhatsUp Jul 20 '15 at 21:43
  • Oh, I think I get it. It's with the meaning of `size(x)`. `size(x)` is always 16 for a 16 bit number, regardless of the value of `x`. – biziclop Jul 20 '15 at 21:44
  • 1
    Aha, now I get it too... so anyway that expression on Wiki is not correct - `size(x)` should be a function of the number `x`, independent of how you store it. – WhatsUp Jul 20 '15 at 21:50
  • Sorry, I didn't want to sound argumentative, just genuinely didn't understand what the problem was. – biziclop Jul 20 '15 at 21:56

1 Answers1

2

I wouldn't think there's a universally good hash function, that's the catch of bucket sort. A hash is good if it produces roughly equal size buckets, but that obviously depends on the distribution of the values you're sorting. This is why bucket sort works so well when you have a priori knowledge of the distribution, for example when you have to sort records of people by their height.

Furthermore, the worst case of bucket sort isn't counting sort, as the Geekview link erroneously suggests. The worst case (regarding time complexity) is when all the elements go into the same bucket.

And of course counting sort is a kind of bucket sort, specifically a bucket sort with hash h(x) = x. Where counting sort is different is that once you know your buckets will only ever hold a single value, you don't really need the buckets to store the elements themselves, just their count.

biziclop
  • 48,926
  • 12
  • 77
  • 104
  • Concerning your first part, Wikipedia seems to think `msbits(array[i], k)` would do the trick, regardless of particularities. I just don't know where they get k from. – Katedral Pillon Jul 20 '15 at 21:05
  • @KatedralPillon No, that's just an example. It's easy to construct an input set where `msbits()` will return the same value for every element and thus put everything in the same bucket. – biziclop Jul 20 '15 at 21:09