Short answer: no. 8 bytes (64 bits) of hash would not be very reliable to distinguish "millions" of URLs.
See this table: https://en.wikipedia.org/wiki/Birthday_problem#Probability_table
The table indicates probability of collision per given number of elements. Judgements here all depend on two things: how many elements are in your set, and your definition of "reliable".
For a software that is intended to work repeatedly, you should probably assess:
- The maximum number of elements (URLs) you expect to be present, particularly if you're progressively adding to the set;
- The number of distinct datasets (or number of the times) your software will be used with over the expected lifetime of your software.
- What the acceptable error rate is for your software per period/ or over its lifetime.
- How expensive it would be to fix an error -- is the hash in memory & you just need to re-run, or are you collecting these persistently in a DB/file and they are expensive to recollect?
There are a few calculators available to help calculate. I've used this one: https://kevingal.com/apps/collision.html Switch it to 'bits' mode rather than buckets.
An example assessment:
- 10 million URLs added per year
- Software lifetime is 20 years & run 100 times a year, but it's always the same dataset
- We accept only a one-in-a-million chance of error.
- = 200 million URLs
- => By calculator, a hash of 74 bits would give you ~1.1e-6 chance of collision -- but just go for 128 bits.
Another possible assessment:
- 50 million URLs or elements in each dataset
- Software lifetime is 20 years & run 100 times a year, each time with an independent dataset
- We accept only a one-in-a-million lifetime chance of error.
- = 50 million URLs
- = Software needs to run 2000 times, with an lifetime probability of one-in-a-million
- => Therefore probability of error on one run needs to be ~5e-10.
- => By calculator, a hash of 81 bits would give you 5.2e-10 chance of collision -- just go for 128 bits.
Conclusion: just keep it simple & robust, use 128 bits of hash.
A few last words: XOR'ing is neither necessary or desirable; just truncate the hash to the number of bytes you are going to use.
If robust uniqueness is important, I suggest you stick with a cryptographic grade hash such as SHA-256 or SHA-1. For security purposes, SHA-256 is stronger (more resistant to cryptographic attacks) than SHA-1. Both these algorithms are still very fast.