I am saving URL's in a database, and when i insert a new one, i want to check if that url exists already in the database.
A common practice (if i'm not mistaken) is to hash the urls using md5 or sha-1 etc... and checking that field in database for duplicates prior inserting a new one.
I know md5 can produce collisions, also sha-1...
What do you suggest for me? My needs are:
DB Size: Eventually 10 to 20 Millions of records on database
Performance/Speed: Small hash size so database will not have heavy load checking for duplicates (there is going to be index of course on that field)
Tolerance: I don't care if i get 1 collision on every 100,000 records. My needs are more towards performance (small hash) rather than 0% collisions (big hash).
Chance of attack by malformed URLs to produce collisions on purpose: Extremely Low
Maximum damage possible in case of such a successful attack: Extremely Low
Questions:
Do you believe md5 is enough (Something better to suggest)?
Maybe md5 is even overkill for me and i could seriously can get performance benefits by using something simpler?
Thank you in advance guys!