Check if two hashes have the same set of keys

Question

What is the most efficient way to check if two hashes h1 and h2 have the same set of keys disregarding the order? Could it be made faster or more concise with close efficiency than the answer that I post?

I did with a limited example. `h1.keys.sort == h2.keys.sort` was a bit slower. But I am not sure if this is the case in general. — sawa, Dec 09 '12 at 10:22
I think you should mention that in the question. And also I would post the solution as part of the question, not as an answer. — Sergio Tulentsev, Dec 09 '12 at 10:25
I didn't think my answer is special than any others. I thought it would be better to have all possible answers including mine listed together rather than having one random one (mine) within the question. — sawa, Dec 09 '12 at 10:26
I think that it's pure convenience. You write "could it be easier than my answer"? Now I have to scroll down, parse answers and find yours. It's extra work for me for no reason. — Sergio Tulentsev, Dec 09 '12 at 10:29
I am asking for the fastest solution. In any case, you would be comparing with other answers posted at that point. — sawa, Dec 09 '12 at 10:30
A little off-topic question: is it for pure fun or do you have VERY large hashes (and you have profiled your code) and improving this part of code will give you HUGE performance boost? — Tomek Wałkuski, Dec 09 '12 at 10:52
I am using this iteratively many times for hashes that are not so big. — sawa, Dec 09 '12 at 11:08
@TomaszWałkuski this is not offtopic, but the most ontopic question I can think of! Any solution will depend on the use case. — akuhn, Dec 09 '12 at 12:15

Jan · Answer 1 · 2012-12-09T13:06:40.157

Alright, let's break all rules of savoir vivre and portability. MRI's C API comes into play.

/* Name this file superhash.c. An appropriate Makefile is attached below. */
#include <ruby/ruby.h>

static int key_is_in_other(VALUE key, VALUE val, VALUE data) {
  struct st_table *other = ((struct st_table**) data)[0];
  if (st_lookup(other, key, 0)) {
    return ST_CONTINUE;
  } else {
    int *failed = ((int**) data)[1];
    *failed = 1;
    return ST_STOP;
  }
}

static VALUE hash_size(VALUE hash) {
  if (!RHASH(hash)->ntbl)
    return INT2FIX(0);
  return INT2FIX(RHASH(hash)->ntbl->num_entries);
}

static VALUE same_keys(VALUE self, VALUE other) {
  if (CLASS_OF(other) != rb_cHash)
    rb_raise(rb_eArgError, "argument needs to be a hash");
  if (hash_size(self) != hash_size(other))
    return Qfalse;
  if (!RHASH(other)->ntbl && !RHASH(other)->ntbl)
    return Qtrue;
  int failed = 0;
  void *data[2] = { RHASH(other)->ntbl, &failed };
  rb_hash_foreach(self, key_is_in_other, (VALUE) data);
  return failed ? Qfalse : Qtrue;
}

void Init_superhash(void) {
  rb_define_method(rb_cHash, "same_keys?", same_keys, 1);
}

Here's a Makefile.

CFLAGS=-std=c99 -O2 -Wall -fPIC $(shell pkg-config ruby-1.9 --cflags)
LDFLAGS=-Wl,-O1,--as-needed $(shell pkg-config ruby-1.9 --libs)
superhash.so: superhash.o
    $(LINK.c) -shared $^ -o $@

An artificial, synthetic and simplistic benchmark shows what follows.

require 'superhash'
require 'benchmark'
n = 100_000
h1 = h2 = {a:5, b:8, c:1, d:9}
Benchmark.bm do |b|
  # freemasonjson's state of the art.
  b.report { n.times { h1.size == h2.size and h1.keys.all? { |key| !!h2[key] }}}
  # This solution
  b.report { n.times { h1.same_keys? h2} }
end
#       user     system      total        real
#   0.310000   0.000000   0.310000 (  0.312249)
#   0.050000   0.000000   0.050000 (  0.051807)

woah thats awesome! I def gotta go back to knowing C – strider Dec 09 '12 at 17:31 — strider, Dec 09 '12 at 17:31

score 7 · Answer 2 · edited May 23 '17 at 12:10

7

Combining freemasonjson's and sawa's ideas:

h1.size == h2.size and (h1.keys - h2.keys).empty?

edited May 23 '17 at 12:10

Community

1
1

answered Dec 09 '12 at 11:39

Jan

11,636
38
47

strider · Accepted Answer · 2012-12-09T11:43:31.903

5

Try:

# Check that both hash have the same number of entries first before anything
if h1.size == h2.size
    # breaks from iteration and returns 'false' as soon as there is a mismatched key
    # otherwise returns true
    h1.keys.all?{ |key| !!h2[key] }
end

Enumerable#all?

worse case scenario, you'd only be iterating through the keys once.

edited Dec 09 '12 at 11:43

answered Dec 09 '12 at 11:29

strider

5,674
4
24
29

2

Even better, `h2.include?(key)`. – akuhn Dec 09 '12 at 11:49
1

I did some benchmarks and it seems that this answer is a clear winner so far. Using `Hash#include?` doesn't bring any improvements to performance but it's surely a good step forward in terms of readability. – Jan Dec 09 '12 at 11:51
1

`if a then b end` -> `a && b` – tokland Dec 09 '12 at 12:01
@Jan caution with benchmarks. In particular synthetic ones! This solution (whether using include or not) will be faster if and only if the key sets differ more often than not. If the dominating case is equals keys sets, it will be slower. – akuhn Dec 09 '12 at 12:13
@akuhn, thanks for the comment. I agree with you regarding synthetic benchmarks. However, in the benchmark I did `h1 == h2` and as a result key sets _were equal_. – Jan Dec 09 '12 at 12:24
@Jan interesting, so even for equal keysets this one is faster than all other solutions presented here? – akuhn Dec 09 '12 at 12:32
1

@akuhn, that's what my benchmark showed. It came as a surprise but when I gave it a thought it makes sense. Unlike other answers this solution doesn't create many additional objects in the memory. As a result it's GC-friendly, which in the light of MRI's GC's performance is a huge benefit. – Jan Dec 09 '12 at 12:34
@Jan interesting. Also, it might be that Ruby's implementation of `Array#-` isn't the smartest either. Did you try `Set.new(h1.keys) == Set.new(h2.keys)` !? – akuhn Dec 09 '12 at 12:37
@akuhn, no, I haven't. OP mentioned small hashes so I don't think it's a good direction. – Jan Dec 09 '12 at 12:40

Vincent B. · Answer 4 · 2012-12-16T01:55:23.200

Just for the sake of having at least a benchmark on this question...

require 'securerandom'
require 'benchmark'

a = {}
b = {}

# Use uuid to get a unique random key
(0..1_000).each do |i|
  key = SecureRandom.uuid
  a[key] = i
  b[key] = i
end

Benchmark.bmbm do |x|
  x.report("#-") do
    1_000.times do
      (a.keys - b.keys).empty? and (a.keys - b.keys).empty?
    end
  end

  x.report("#&") do
    1_000.times do
      computed = a.keys & b.keys
      computed.size == a.size
    end
  end

  x.report("#all?") do
    1_000.times do
      a.keys.all?{ |key| !!b[key] }
    end
  end

  x.report("#sort") do
    1_000.times do
      a_sorted = a.keys.sort
      b_sorted = b.keys.sort
      a == b
    end
  end
end

Results are:

Rehearsal -----------------------------------------
#-      1.000000   0.000000   1.000000 (  1.001348)
#&      0.560000   0.000000   0.560000 (  0.563523)
#all?   0.240000   0.000000   0.240000 (  0.239058)
#sort   0.850000   0.010000   0.860000 (  0.854839)
-------------------------------- total: 2.660000sec

            user     system      total        real
#-      0.980000   0.000000   0.980000 (  0.976698)
#&      0.560000   0.000000   0.560000 (  0.559592)
#all?   0.250000   0.000000   0.250000 (  0.251128)
#sort   0.860000   0.000000   0.860000 (  0.862857)

I have to agree with @akuhn that this would be a better benchmark if we had more information on the dataset you are using. But that being said, I believe this question really needed some hard fact.

I'd recommend adding the name of the benchmark to the `report` method as a parameter. That will enable adding the name to the result report, making it a lot easier to read. — the Tin Man, Dec 09 '12 at 16:02

score 3 · Answer 5 · edited Dec 09 '12 at 15:59

It depends on your data.

There is no general case really. For example, generally retrieving the entire keyset at once is faster than checking inclusion of each key seperately. However, if in your dataset, the keysets differ more often than not, then a slower solution which fails faster might be faster. For example:

h1.size == h2.size and h1.keys.all?{|k|h2.include?(k)}

Another factor to consider is the size of your hashes. If they are big a solution with higher setup cost, like calling Set.new, might pay off, if however they are small, it won't:

h1.size == h2.size and Set.new(h1.keys) == Set.new(h2.keys)

And if you happen to compare the same immutable hashes over and over again, it would definitely pay off to cache the results.

Eventually only a benchmark will tell, but, to write a benchmark, we'd need to know more about your use case. For sure, testing a solution with synthetic data (as for example, randomly generated keys) will not be representative.

score 1 · Answer 6 · answered Dec 09 '12 at 10:15

1

This is my try:

(h1.keys - h2.keys).empty? and (h2.keys - h1.keys).empty?

answered Dec 09 '12 at 10:15

sawa

165,429
45
277
381

score 0 · Answer 7 · answered Jul 18 '18 at 19:23

Here is my solution:

class Hash
    # doesn't check recursively
    def same_keys?(compare)
        if compare.class == Hash
            if self.size == compare.size
               self.keys.all? {|s| compare.key?(s)}
            else
                return false
            end
        else
            nil
        end
    end
end

a = c = {  a: nil,    b: "whatever1",  c: 1.14,     d: true   }
b     = {  a: "foo",  b: "whatever2",  c: 2.14,   "d": false  }
d     = {  a: "bar",  b: "whatever3",  c: 3.14,               }

puts a.same_keys?(b)                    # => true
puts a.same_keys?(c)                    # => true
puts a.same_keys?(d)                    # => false   
puts a.same_keys?(false).inspect        # => nil
puts a.same_keys?("jack").inspect       # => nil
puts a.same_keys?({}).inspect           # => false

Check if two hashes have the same set of keys

7 Answers7

Linked