1

If I run the following one-line ruby script multiple times, it produces a different output value each time.

puts "This is a string".hash

What's going on here? How should I change it to get a consistent, reproducible value from .hash for any given input string?

Edit: The "possible duplicate" suggests other hashing methods. I'm trying to reproduce the behavior of another script I have no control over that uses .hash and gets consistent results. Changing hashing methods is not an option.

Edit #2: As noted in another comment below, the other script whose behavior I want to reproduce is inside an .exe wrapper. It dates from 2006, which means the Ruby version must be 1.8.5 or earlier. Did the #hash method work differently in earlier versions of Ruby, and if so, has anyone produced a script that replicates the behavior of those earlier versions? (It can be by a different name.)

  • Possible duplicate of [Consistent String#hash based only on the string's content](https://stackoverflow.com/questions/6536885/consistent-stringhash-based-only-on-the-strings-content) – pr0f3ss Aug 22 '19 at 14:25
  • Per the documentation: "The hash value for an object may not be identical across invocations or implementations of Ruby. If you need a stable identifier across Ruby invocations and implementations you will need to generate one with a custom method." https://ruby-doc.org/core-2.4.1/Object.html#method-i-hash Can you give more info regarding what gets consistent results? – Sara Fuerst Aug 22 '19 at 15:02
  • You are evidently using a version of Ruby prior to 2.3. That version made a change to how literals are stored. Specifically, [literals having the same value point to the same object](https://www.wyeworks.com/blog/2015/12/01/immutable-strings-in-ruby-2-dot-3/), so `"This is a string".hash == "This is a string".hash #=> true` for v2.3+. I expect that for earlier version `"This is a string".freeze.hash == "This is a string".hash #=> true`. Can you test that? – Cary Swoveland Aug 22 '19 at 16:32
  • @CarySwoveland: There is still no guarantee that this value will be the same across invocations. – Jörg W Mittag Aug 22 '19 at 16:33
  • @Jörg, yes, but we don't know if the OP is asking about consistency between invocations or consistency for a single invocation. – Cary Swoveland Aug 22 '19 at 16:36
  • @CarySwoveland: Running a script multiple times sure sounds like multiple invocations. – Jörg W Mittag Aug 22 '19 at 16:39
  • @Jörg, where's my coffee? – Cary Swoveland Aug 22 '19 at 16:42
  • I was looking for consistency between invocations. The other script I was referring to is actually embedded in a .exe wrapper that has no dependency on the user having any version of ruby installed. That .exe is more than 10 years old, and if .hash behaved differently (i.e. producing consistent output) in older versions of Ruby, that's probably what's going on here. – Joe McCauley Aug 22 '19 at 21:21

2 Answers2

3

What's going on here?

#hash should be different for different objects and the same for equal objects during the lifetime of the program. There is absolutely no guarantee whatsoever about what the value is across different invocations of the program.

The documentation is very explicit here (bold emphasis mine):

The hash value for an object may not be identical across invocations or implementations of Ruby. If you need a stable identifier across Ruby invocations and implementations you will need to generate one with a custom method.

[Note: for some reason, the documentation for current versions of Ruby isn't rendered correctly on ruby-doc.org. It is identical in the current master branch, though.]

How should I change it to get a consistent, reproducible value from .hash for any given input string?

Not use it.

Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
0

I think it might be helpful to understand what #hash is for. It is used to bucket a Ruby object into a specific bucket of a Hash data structure - or, alternatively, to include it into a Set - but this is an implementation detail because Ruby Sets are implemented "on top" of a Hash. It is not used to digest a value. Once you know that, it becomes apparent that #hash should not satisfy the following constraints:

  • Minimize collisions - it is fine to have collisions sometimes since a bucket in a Hash can regress into a search if there are multiple items
  • Stable across lifetimes of the virtual machine - not required, because hashes are "reconstructed" anew every time, even when you do marshaling

It should satisfy the following constraints

  • Stable within the same lifetime of a VM - otherwise the item might have to be "migrated" to a different bucket in a Hash, which is impossible to achieve. This is why strings get frozen when used as Hash keys
  • Fast to compute
  • Fit into the arbitrary "key size" used by the Ruby Hash buckets (in MRI it is the size of st_index_t I believe)

The second requirement can be satisfied in multiple ways. For example, it can be satisfied by using a faster hashing function. But it can also be satisfied by doing a lookup of "arbitrary" computed hash values for, say, Strings and if this specific String is a duplicate of another - by reusing that value. Another approach - which is also sometimes applied - is to derive the hash value from the Ruby object ID - which per definition changes across the runs of the virtual machine.

So indeed what Jörg said - for your purpose the hash() function is not a good fit, because it is made for a different use case. There is a whole number of alternatives though - the usual SHA's, murmur hash, xxhash and so on - which might satisfy your requirements and are guaranteed to be content-derived.

Julik
  • 7,676
  • 2
  • 34
  • 48