I think it might be helpful to understand what #hash
is for. It is used to bucket a Ruby object into a specific bucket of a Hash
data structure - or, alternatively, to include it into a Set
- but this is an implementation detail because Ruby Sets are implemented "on top" of a Hash. It is not used to digest a value. Once you know that, it becomes apparent that #hash
should not satisfy the following constraints:
- Minimize collisions - it is fine to have collisions sometimes since a bucket in a Hash can regress into a search if there are multiple items
- Stable across lifetimes of the virtual machine - not required, because hashes are "reconstructed" anew every time, even when you do marshaling
It should satisfy the following constraints
- Stable within the same lifetime of a VM - otherwise the item might have to be "migrated" to a different bucket in a Hash, which is impossible to achieve. This is why strings get frozen when used as Hash keys
- Fast to compute
- Fit into the arbitrary "key size" used by the Ruby Hash buckets (in MRI it is the size of
st_index_t
I believe)
The second requirement can be satisfied in multiple ways. For example, it can be satisfied by using a faster hashing function. But it can also be satisfied by doing a lookup of "arbitrary" computed hash values for, say, Strings and if this specific String is a duplicate of another - by reusing that value. Another approach - which is also sometimes applied - is to derive the hash value from the Ruby object ID - which per definition changes across the runs of the virtual machine.
So indeed what Jörg said - for your purpose the hash() function is not a good fit, because it is made for a different use case. There is a whole number of alternatives though - the usual SHA's, murmur hash, xxhash and so on - which might satisfy your requirements and are guaranteed to be content-derived.