Calculate original set size after hash collisions have occurred

Question

You have an empty ice cube tray which has n little ice cube buckets, forming a natural hash space that's easy to visualize.

Your friend has k pennies which he likes to put in ice cube trays. He uses a random number generator repeatedly to choose which bucket to put each penny. If the bucket determined by the random number is already occupied by a penny, he throws the penny away and it is never seen again.

Say your ice cube tray has 100 buckets (i.e, would make 100 ice cubes). If you notice that your tray has c=80 pennies, what is the most likely number of pennies (k) that your friend had to start out with?

If c is low, the odds of collisions are low enough that the most likely number of k == c. E.g. if c = 3, then it's most like that k was 3. However, the odds of a collision are increasingly likely, after say k=14 then odds are there should be 1 collision, so maybe it's maximally likely that k = 15 if c = 14.

Of course if n == c then there would be no way of knowing, so let's set that aside and assume c < n.

What's the general formula for estimating k given n and c (given c < n)?

Timothy Shields · Answer 1 · 2014-02-01T06:14:13.400

The problem as it stands is ill-posed.

Let n be the number of trays.
Let X be the random variable for the number of pennies your friend started with.
Let Y be the random variable for the number of filled trays.

What you are asking for is the mode of the distribution P(X|Y=c).
(Or maybe the expectation E[X|Y=c] depending on how you interpret your question.)

Let's take a really simple case: the distribution P(X|Y=1). Then

P(X=k|Y=1) = (P(Y=1|X=k) * P(X=k)) / P(Y=1)
= (1/n^k-1 * P(X=k)) / P(Y=1)

Since P(Y=1) is normalizing constant, we can say P(X=k|Y=1) is proportional to 1/n^k-1 * P(X=k).

But P(X=k) is a prior probability distribution. You have to assume some probability distribution on the number of coins your friend has to start with.

For example, here are two priors I could choose:

My prior belief is that P(X=k) = 1/2^k for k > 0.
My prior belief is that P(X=k) = 1/2^{k - 100} for k > 100.

Both would be valid priors; the second assumes that X > 100. Both would give wildly different estimates for X: prior 1 would estimate X to be around 1 or 2; prior 2 would estimate X to be 100.

I would suggest if you continue to pursue this question you just go ahead and pick a prior. Something like this would work nicely: WolframAlpha. That's a geometric distribution with support k > 0 and mean 10^4.

Calculate original set size after hash collisions have occurred

1 Answers1