57

I am performing multiple iterations of the type:

masterSet=masterSet.union(setA)

As the set grows the length of time taken to perform these operations is growing (as one would expect, I guess).

I expect that the time is taken up checking whether each element of setA is already in masterSet?

My question is that if i KNOW that masterSet does not already contain any of elements in setA can I do this quicker?

[UPDATE]

Given that this question is still attracting views I thought I would clear up a few of the things from the comments and answers below:

When iterating though there were many iterations where I knew setA would be distinct from masterSet because of how it was constructed (without having to process any checks) but a few iterations I needed the uniqueness check.

I wondered if there was a way to 'tell' the masterSet.union() procedure not to bother with the uniquness check this time around as I know this one is distinct from masterSet just add these elements quickly trusting the programmer's assertion they were definately distict. Perhpas through calling some different ".unionWithDistinctSet()" procedure or something.

I think the responses have suggested that this isnt possible (and that really set operations should be quick enough anyway) but to use masterSet.update(setA) instead of union as its slightly quicker still.

I have accepted the clearest reponse along those lines, resolved the issue I was having at the time and got on with my life but would still love to hear if my hypothesised .unionWithDistinctSet() could ever exist?

Stewart_R
  • 13,764
  • 11
  • 60
  • 106

4 Answers4

112

You can use set.update to update your master set in place. This saves allocating a new set all the time so it should be a little faster than set.union...

>>> s = set(range(3))
>>> s.update(range(4))
>>> s
set([0, 1, 2, 3])

Of course, if you're doing this in a loop:

masterSet = set()
for setA in iterable:
    masterSet = masterSet.union(setA)

You might get a performance boost by doing something like:

masterSet = set().union(*iterable)

Ultimately, membership testing of a set is O(1) (in the average case), so testing if the element is already contained in the set isn't really a big performance hit.

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • @jamylak -- Updating sets is much less common than `dict.update` or `set.union` for some reason. (I had to `dir(set)` to figure out that it wasn't `set.extend` ;-) – mgilson Jun 05 '13 at 12:22
  • I was going to suggest `|=` – jamylak Jun 05 '13 at 12:29
  • 1
    @jamylak -- i would guess they end up being the same method ... Although maybe not. with `|=`, the right hand side might need to be a set whereas that restriction is relaxed with `set.extend`. – mgilson Jun 05 '13 at 12:31
  • 1
    oh right I remember coming to that conclusion as well and therefore didn't post that – jamylak Jun 05 '13 at 12:32
  • Why is `set.union` slower than `set.update`? They seem to do the same thing – theonlygusti Dec 10 '22 at 22:43
  • The source is helpful here -- https://github.com/python/cpython/blob/2e279e85fece187b6058718ac7e82d1692461e26/Objects/setobject.c#L1115-L1136. `c = a.union(b)` is effectively `c = a.copy(); c.update(b)` so it has that additional copy in the mix compared to `update`. – mgilson Dec 11 '22 at 23:20
9

As mgilson points out, you can use update to update a set in-place from another set. That actually works out slightly quicker:

def union():
    i = set(range(10000))
    j = set(range(5000, 15000))
    return i.union(j)

def update():
    i = set(range(10000))
    j = set(range(5000, 15000))
    i.update(j)
    return i

timeit.Timer(union).timeit(10000)   # 10.351907968521118
timeit.Timer(update).timeit(10000)  # 8.83384895324707
Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
6

If you know your elements are unique, a set is not necessarily the best structure.

A simple list is way faster to extend.

masterList = list(masterSet)
masterList.extend(setA)
njzk2
  • 38,969
  • 7
  • 69
  • 107
  • On some iterations i need to test for uniqueness but on most I know that setA and masterSet are distinct. Is there no way to 'tell' the union() method (or use an alternative method) that I know the sets are distinct? – Stewart_R Jun 05 '13 at 13:05
  • @mgilson you are assuming OP is not making any membership checks?? – jamylak Jun 05 '13 at 13:05
  • @jamylak -- Yes. I feel that this answers the last sentence in the question. – mgilson Jun 05 '13 at 13:14
  • on the other hand, a check in a set is o(1), so the overhead in union is not very large. – njzk2 Jun 05 '13 at 13:37
1

For sure, forgoing this check could be a big saving when the __eq__(..) method is very expensive. In the CPython implementation, __eq__(..) is called with every element already in the set that hashes to the same number. (Reference: source code for set.)

However, there will never be this functionality in a million years, because it opens up another way to violate the integrity of a set. The trouble associated with that far outweighs the (typically negligible) performance gain. While if this is determined as a performance bottleneck, it's not hard to write a C++ extension, and use its STL <set>, which should be faster by one or more orders of magnitude.

Evgeni Sergeev
  • 22,495
  • 17
  • 107
  • 124