35

GOAL: Map every URL handled by a server to 0, 1, 2, or 3, distributing as uniformly as possible.

While the documentation for ruby's String#hash method says it will "return a hash based on the string‘s length and content," this clearly isn't the whole story. A given string's hash is not consistent across invocations of the interpreter:

$ irb
ruby-1.9.2-p180 :001 > "foo".hash
 => 360517580588231756 
ruby-1.9.2-p180 :002 > ^D

$ irb
ruby-1.9.2-p180 :001 > "foo".hash
 => -2716152678666510148 

This means a particular string's hash value may differ across, say, servers. Rails uses String#hash internally to map a URL path to one of four asset hosts (if the app's asset_host is so configured), but this feature is a lot less efficient than it could be because of the cross-machine inconsistencies; different servers may map the same URL to different asset hosts, reducing the effectiveness of caches, clouding skies, cooling cups of tea prematurely, besmirching the reputations of otherwise fine programmers.

Can you suggest an alternate hash function that could effectively and speedily distribute hashes across a typical app's URL space, preferably one that produces a Fixnum since, in the end, I'll want to map it into one of four asset hosts?

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Rob Davis
  • 15,597
  • 5
  • 45
  • 49
  • Just wondering, did you ever find a good solution to this? – robd Mar 15 '13 at 17:43
  • 3
    CRC may be a good solution - see http://stackoverflow.com/questions/4452161/do-ruby-1-8-and-1-9-have-the-same-hash-code-for-a-string – robd Mar 15 '13 at 17:55

4 Answers4

45

there are lot of such functionality in ruby's digest module: http://ruby-doc.org/stdlib/libdoc/digest/rdoc/index.html

simple example:

require 'digest/sha1'
Digest::SHA1.hexdigest("some string")
keymone
  • 8,006
  • 1
  • 28
  • 33
  • 4
    True, but are SHA1 and MD5 overkill? Too slow? – Rob Davis Jun 30 '11 at 15:09
  • 1
    fast enough i'd say. you can check google's CityHash which targets performance and minimizing collisions in general text strings but give sha1/md5 a try and test the performance. – keymone Jun 30 '11 at 15:11
  • 3
    @RobDavis the "hex" in #hexdigest suggests that output is a hexadecimal number, to convert it to int all you have to do is call to_i(16) – keymone Jun 12 '15 at 12:38
  • i don't know constraints of your system but sha *is* a real solution to your problem, just as md5 is and murmur hash and crc32. the difference is the space you're mapping urls onto - only 4bn numbers in 32 bits which is not good enough for almost any problem i can think of. instead of restricting yourself to short 32bit integers, try representing large integers in higher base - base64/62 is a great way to represent huge integers with few characters. – keymone Jun 15 '15 at 08:10
6

The easiest (and consistent) way may be this (and it's fast):

"https://www.example.com/abc/def/123?hij=345".sum % 4

That will always produce an integer 0 - 3, is quite fast, and should be fairly well distributed (though I haven't actually run tests on distribution).

Jason
  • 519
  • 5
  • 14
2

There is tiny library xxHash:

XXhash.xxh32('qwe') #=> 2396643526
XXhash.xxh64('qwe') #=> 9343136760830690622

Maybe it will have more collisions but it is 10x faster than SHA1:

Benchmark.bm do |x|
  n = 100_000
  str = 'qweqweqwe'
  x.report('xxhash32') { n.times { XXhash.xxh32(str) } }
  x.report('xxhash64') { n.times { XXhash.xxh64(str) } }
  x.report('hexadigest') { n.times { Digest::SHA1.hexdigest(str) } }
end;1

#       user     system      total        real
# xxhash32  0.020000   0.000000   0.020000 (  0.021948)
# xxhash64  0.040000   0.000000   0.040000 (  0.036340)
# hexadigest  0.240000   0.030000   0.270000 (  0.276443)
Lev Lukomsky
  • 6,346
  • 4
  • 34
  • 24
  • Just benchmark note: on my i7 8th gen Lenovo laptop the `XXhash.xxh32(str)` was ` 0.016127` ; `xx64(str)` was `0.020158)` and `hexdigest` was ` 0.212843)` so i would def go with xxh32 – equivalent8 Oct 24 '19 at 09:26
1

You can try to_i(36).

"Hash me please :(".to_i(36)
=> 807137
retro
  • 3,765
  • 2
  • 20
  • 37
  • 1
    Although that only seems to look at the first four characters: `"Hash something else".to_i(36)` also produces 807137 – Rob Davis Jun 30 '11 at 16:47
  • 1
    It works only until first space. So it is usable for URLs. There is similar method on Fixnum called to_s(36). – retro Jun 30 '11 at 23:33
  • Actually, it seems to stop at the first slash. – Rob Davis Jul 01 '11 at 13:20
  • 2
    It's going to stop at the first non-alphanumeric character, because you are specifying a number base. You could strip such characters, but then you might have to worry about overflow (since you are basically asking it to parse a huge number). – benzado Mar 21 '12 at 04:59
  • 1
    This also doesn't distribute evenly, but rather is weighted according to the frequency of the last letter. E.g. "e" is more common in English, so you'd have more requests going to the server associated with e. – Edward Anderson Mar 13 '17 at 02:31