0

I am trying to create an alternative to a bloom filter. I am using an array of bits that has capacity to hold 100 billion bits (around 25 GB). Initially, all the bits will be set to zero.The steps I will take to create it are as follows :

  1. I will take an input and generate a hash using SHA-256(due to less chances of collision) and perform modulus operation with 100 billion on the generated hash to obtain a value say N.
  2. I will set the bit on the Nth position in the array to 1.
  3. If the bit is already set on the Nth position, then I will add the input to a bucket specific for that bit.

How do I find the increase in the number of collisions as a result of performing modulus on the hash value ?

If I have 40 billion entries as the input, what are the chances of collisions using the proposed method?

DarkSquirrel42
  • 10,167
  • 3
  • 20
  • 31
Aniketh Jain
  • 603
  • 7
  • 25
  • I remember having used MD5 hash and a modulus operation in the past. The distribution was very poor with a lot of collisions. We then changed the modulus to be a prime number and all problems disappeared. The distribution became as uniform as one would like to. – Codo Dec 04 '16 at 17:12
  • Thanks for the quick response. Is there a formula to find out the probability of the number of collisions that would happen if I perform modulus on the generated hash value? – Aniketh Jain Dec 04 '16 at 18:18
  • I'm not good in probability calculus. It's probably the same formula as calculating the probability of two people of a group having birthday on the same day of the year. – Codo Dec 04 '16 at 21:15

0 Answers0