1

I was wondering if anyone is able to provide some insight into the chances of collisions when using FARM_FINGERPRINT in BigQuery to generate INT64 hashes to be used as Surrogate Keys on tables?

Going with a normal UUID increases storage of the key columns x4. I was thinking FARM_FINGERPRINT(GENERATE_UUID()) might provide an INT64 alternative. I know collisions are always a concern but reading the SMHasher output for FarmHash it looks like it could be an option as it is not showing any collision issues at present.

Other than the size I have users concerned about join performance on STRING vs INT64 surrogate keys in BigQuery. I cannot find anything official that speaks to it to calm the fears. Hence why considering this method to generate an INT64 hash.

ldrg
  • 4,150
  • 4
  • 43
  • 52
Twist
  • 462
  • 1
  • 5
  • 13
  • 1
    You can read about it here: https://github.com/google/farmhash . Realistically you are probably fine, as the INT64 space is in the quintillions of possibilities. If you are really concerned, just concatenate any values and use that as a surrogate key. – rtenha Aug 09 '19 at 15:51
  • Thank you @rtenha I have read that already but was wondering of there was anywhere that a more formal investigation of collision possibilities were published. – Twist Aug 20 '19 at 14:25
  • In regards to join performance: in a real relational database, if you made an index for the key column, then it would hash the strings and save the hashes. So the performance would be the same either way. BigQuery however does not have indices (partitioning is similar but not supported for string columns). So at best, the processing for the first merge is likely to include time to calculate all the hashes. It does some internal caching - how much that helps you would have to discover by experiment. – chrishmorris May 30 '23 at 07:09

0 Answers0