Ruby 1.8.6 Array#uniq not removing duplicate hashes

Question

I have this array, in a ruby 1.8.6 console:

arr = [{:foo => "bar"}, {:foo => "bar"}]

both elements are equal to each other:

arr[0] == arr[1]
=> true
#just in case there's some "==" vs "===" oddness...
arr[0] === arr[1]
=> true

But, arr.uniq doesn't remove the duplicates:

arr.uniq
=> [{:foo=>"bar"}, {:foo=>"bar"}]

Can anyone tell me what's going on here?

EDIT: I can write a not very clever uniqifier which uses include? as follows:

uniqed = []
arr.each do |hash|
  unless uniqed.include?(hash)
    uniqed << hash
  end
end;false
uniqed
=> [{:foo=>"bar"}]

This produces the correct result, which makes the failure of uniq even more mysterious.

EDIT 2: Some notes on what's going on, possibly just for my own clarity. As @Ajedi32 points out in the comments, the failure to uniqify comes from the fact that the two elements are different objects. Some classes define eql? and hash methods, used for comparison, to mean "are these effectively the same thing, even if they're not the same object in memory". String does this for example, which is why you can define two variables to be "foo" and they are said to be equal to one another, even though they're not the same object.

The Hash class doesn't do this, in Ruby 1.8.6, and so when .eql? and .hash are called on a hash object (the .hash method has nothing to do with the Hash data type - it's like the checksum kind of hash) it falls back to using the methods defined in the Object base class, which simply say "Is it the same object in memory".

The == and === operators, for hash objects, already do what I want, ie to say that two hashes are the same if their contents are the same. I've overriden Hash#eql? to use these, like so:

class Hash
  def eql?(other_hash)
    self == other_hash
  end
end

But, I'm not sure how to handle Hash#hash: that is, I don't know how to generate a checksum which will be the same for two hashes whose contents are the same and always different for two hashes with different contents.

@Ajedi32 suggested I have a look at Rubinius' implentation of the Hash#hash method here https://github.com/rubinius/rubinius/blob/master/core/hash.rb#L589 , and my version of Rubinius' implementation looks like this:

class Hash
  def hash
    result = self.size
    self.each do |key,value|
      result ^= key.hash 
      result ^= value.hash 
    end
    return result
  end
end

and this does seem to work, although I don't know what the "^=" operator does, which makes me a bit nervous. Also, it's very slow - about 50x as slow based on some primitive benchmarking. This might make it too slow to use.

EDIT 3: A bit of research has revealed that "^" is the Bitwise Exclusive OR operator. When we have two inputs, an XOR returns 1 if the inputs are different (ie it returns 0 for 0,0 and 1,1 and 1 for 0,1 and 1,0).

So, at first I thought that means that

result ^= key.hash

is shorthand for

result = result ^ key.hash

In other words, do an XOR between the current value of result and the other thing, and then save that in result. I still don't quite get the logic of this though. I thought that perhaps the ^ operator was something to do with pointers, because calling it on variables works while calling it on the value of the variable doesn't work: eg

var = 1
=> 1
var ^= :foo
=> 14904
1 ^= :foo
SyntaxError: compile error
(irb):11: syntax error, unexpected tOP_ASGN, expecting $end

So, it's fine with calling ^= on a variable but not the value of the variable, which made me think it's something to do with referencing/dereferencing.

Later implementations of Ruby also have C code for the Hash#hash method, and Rubinius' implementaion seems too slow. Bit stuck...

Ruby `1.8.6` was released more than 10 years again, 9 years ago there was the updated `1.8.7` version released and all `1.8.x` versions reached end of life more than 4 years ago. Why do you even care? — spickermann, Nov 14 '17 at 16:44
Why does anyone care about old versions of anything? Because they're forced to use them with legacy sites. — Max Williams, Nov 14 '17 at 16:45
Having to work with legacy apps is one thing, but having to work with an application that uses a version that is outdated for more than 9 years. Wow, that is ridiculous... — spickermann, Nov 14 '17 at 16:49

Ajedi32 · Accepted Answer · 2017-11-15T14:51:21.590

2

For efficiency reasons, Array#uniq does not compare values using == or even ===. According to the docs:

It compares values using their hash and eql? methods for efficiency.

(Note I linked the docs for 2.4.2 here. While the docs for 1.8.6 do not include this statement, I believe it still holds true for that version of Ruby.)

In Ruby 1.8.6, neither Hash#hash nor Hash#eql? are implemented, so they fallback to using Object#hash and Object#eql?:

Equality—At the Object level, == returns true only if obj and other are the same object. Typically, this method is overridden in descendent classes to provide class-specific meaning.

[...]

The eql? method returns true if obj and anObject have the same value. Used by Hash to test members for equality. For objects of class Object, eql? is synonymous with ==.

So according to Array#uniq, those two hashes are different objects, and are therefore unique.

To fix this, you can try defining Hash#hash and Hash#eql? yourself. The details of how to do this are left as an exercise to the reader. You may find it helpful however to refer to Rubinius's implementation of these methods.

edited Nov 15 '17 at 14:51

answered Nov 14 '17 at 16:47

Ajedi32

45,670
22
127
172

Ah - yes: arr[0].eql?(arr[1]) returns false, so it looks like that's the key. thanks! – Max Williams Nov 14 '17 at 16:52
As a bonus question, what do you think is the neatest way to define `eql?` for Hash, to make it behave as expected? (ie to test if the contents are identical). I thought I could just do `self == other_hash` but that feels like trouble for some reason... – Max Williams Nov 14 '17 at 16:56
@MaxWilliams I'd probably check that the class and length of each Hash is the same, then iterate through each key and value using `Hash#each_pair` and compare those with `eql?`. (Though I'm not sure if order should matter?) You might also need to define `Hash#hash` using a similar approach. – Ajedi32 Nov 14 '17 at 16:59
Actually @Ajedi32, I think that `.eql?` isn't the key after all. I added an extension to Hash to make the `eql?` method just do `self == other_hash`, and now I get `arr[0].eql?(arr[1]) => true`. Great. However, `uniq` still keeps the duplicates, making me think that `uniq` doesn't use `eql?` after all. According to my API docs, the source for `uniq` looks like C code. – Max Williams Nov 14 '17 at 17:03
@MaxWilliams Like I said, you probably need to define `Hash#hash` too. "It compares values using their **hash** and eql? methods for efficiency." I suspect it checks whether the hashes match, then if they do it checks eql? to be sure. (Probably implemented using a binary search tree or another, similar datastructure.) You may also wish to refer to Rubinius implementation of these methods, since that implementation is written in Ruby: https://github.com/rubinius/rubinius/blob/master/core/hash.rb#L589 – Ajedi32 Nov 14 '17 at 17:07
ah sorry, I missed that bit. `Hash#hash`, how confusing :) – Max Williams Nov 15 '17 at 08:54
I wrote a large (sorry) couple of Edits to my post about the Hash#hash method, would you mind reading it? I can copy (I think) Rubinius' Hash#hash method, but I don't understand some of it. My version is very slow and I don't know if that's because I've made a mistake "translating" it out of Rubinius... – Max Williams Nov 15 '17 at 10:55
@MaxWilliams Yeah, that's pretty long. Personally I'd suggest splitting that into a couple new questions; one about what the `^=` operator does, and perhaps another one on https://codereview.stackexchange.com/ about how to improve the performance of your existing code. Link them here when they're ready so I can take a look at them; this certainly seems like an interesting problem. – Ajedi32 Nov 15 '17 at 14:42

score 0 · Answer 2 · answered Nov 14 '17 at 16:54

0

How about using JSON stringify and Parsing it back like in Javascript?

require 'json'
arr.map { |x| x.to_json}.uniq.map { |x| JSON.parse(x) }

The json methods might not be supported in 1.8.6 please use which ever is supported.

answered Nov 14 '17 at 16:54

Nandu Kalidindi

6,075
1
23
36

Ruby 1.8.6 Array#uniq not removing duplicate hashes

2 Answers2

Linked