-1

I want optimize code and learn python speed (behaviour). Can you show the fastest way to compare two sets/dicts to find if any duplicate inside.

I do some research but still not sure if it final solution.

from timeit import Timer
import random

random.seed(1)
x = 10

a = dict(zip(random.sample(range(x), x), random.sample(range(x), x)))
b = dict(zip(random.sample(range(x), x), random.sample(range(x), x)))

def setCompare():
  return len(set(a) & set(b)) > 0

def setDisjointCompare():
  return set(a).isdisjoint(set(b))

def dictCompare():
  for i in a:
    if i in b:
      return False
  return True

print Timer(setCompare).timeit()
print Timer(setDisjointCompare).timeit()
print Timer(dictCompare).timeit()

Current results is:

3.95744682634
2.87678853039
0.762627652397
Chameleon
  • 9,722
  • 16
  • 65
  • 127
  • 10
    I don't believe your measures are that meaningful -- not only do you not include the time spent building the dicts in solution (3), but you do include the time spent building the sets in solutions (1) and (2). Were this situation reversed, (1) and (2) may very well end up being faster than (3). – Frédéric Hamidi Aug 13 '14 at 14:00
  • @FrédéricHamidi It should not include constructing object since not object construction should be measure - it is not wrong but it is right choice. This limitation should be not broken. – Chameleon Aug 13 '14 at 14:04
  • 1
    The second `set()` call in `.isdisjoint` is unnecessary. – Ashwini Chaudhary Aug 13 '14 at 14:08
  • 2
    You can re-write your `dictCompare` as `return any(k in b for k in a)` – Jon Clements Aug 13 '14 at 14:16

1 Answers1

3

The comments are correct that you are measuring inconsistently, and I'll show you why. With your current code, I got similar results:

1.44653701782
1.15708184242
0.275780916214

If we change dictCompare() to the following:

def dictCompare():
    temp = set(b)
    for i in set(a):
        if i in temp:
            return False
        return True

We get this result instead:

1.46354103088
1.14659714699
1.09220504761

This time, they are all similar (and slow) because the majority of the time is spent in constructing the sets. By including the set creation in your timing of your first two methods while having the third method utilize existing objects, you were introducing inconsistency.

In your comments, you said you want to exclude the time it takes to create the objects you're going to compare. So lets do this in a consistent way:

# add this below the definitions of a and b
c = set(a)
d = set(b)

# change setCompare and setDisjointCompare()

def setCompare():
    return len(c & d) > 0

def setDisjointCompare():
    return c.isdisjoint(d)

# restore dictCompare() so it matches the OP

Now we get this result:

0.518588066101
0.196290016174
0.269541025162

We've evened the playing field by making all three methods use existing objects. The first two use existing sets, the third uses existing dictionaries. It should come as no surprise that the built-in method (#2) is now the fastest. But remember that we had to take the time to generate the sets before using it, so even though the isdisjoint() method is the fastest, changing our dictionaries to sets just for a comparison is actually going to be slower than the third method, if all we want is a dictionary comparison in the first place.

There is one more option though, similar to what was suggested in the comments:

def anyCompare():
    return not any(k in b for k in a)
# side note: we want to invert the result because we want to return false once
# we find a common element

Adding this as the fourth method has this result:

0.511568069458
0.196676969528
0.268508911133
0.853673934937

Unfortunately, this appears to be slower than the others, which surprised me. As far as I know any() short-circuits the same way our explicit loop does (according to the docs, so I don't know how we were faster in our explicit loop. I suspect the short-circuit may occur later with the any() call since we invert the result at the end, rather than having the negation occur in the loop where we can return immediately when the false condition is encountered.

Among these options, the explicit loops in dictCompare() appear to be the fastest way to check if there are overlapping keys in dictionaries.

BTW, the first method you're using also needs to have its result inverted to be consistent with the others, assuming you want to return False when there is overlap, the same way isdisjoint() does.

skrrgwasme
  • 9,358
  • 11
  • 54
  • 84