I'm trying to understand the probability of collision of new hashes, given no collisions in the existing hash table yet.
For illustration, let's say I have a table where I store hashes of each row.
- The table currently has 1 billion rows
- There are no hash collisions amongst those 1 billion rows.
- I'm using a 64-bit hash algorithm.
Now imagine I insert 10 million new rows of data into the table. What is the probability that I have a hash collision now? I think the answer is the following:
Each new row's hash cannot have the same value of any of the existing rows or the new ones processed before itself. That removes 1 billion hash values from the 2^64 possibilities, so the probability of new collisions should be:
Does that sound right?