Subtraction over a list of sets

Question

Given a list of sets:

allsets = [set([1, 2, 4]), set([4, 5, 6]), set([4, 5, 7])]

What is a pythonic way to compute the corresponding list of sets of elements having no overlap with other sets?

only = [set([1, 2]), set([6]), set([7])]

Is there a way to do this with a list comprehension?

Related: [Replace list of list with "condensed" list of list while maintaining order](http://stackoverflow.com/q/13714755/4279) — jfs, Jan 30 '16 at 07:34

score 18 · Accepted Answer · edited May 23 '17 at 12:19

To avoid quadratic runtime, you'd want to make an initial pass to figure out which elements appear in more than one set:

import itertools
import collections
element_counts = collections.Counter(itertools.chain.from_iterable(allsets))

Then you can simply make a list of sets retaining all elements that only appear once:

nondupes = [{elem for elem in original if element_counts[elem] == 1}
            for original in allsets]

Alternatively, instead of constructing nondupes from element_counts directly, we can make an additional pass to construct a set of all elements that appear in exactly one input. This requires an additional statement, but it allows us to take advantage of the & operator for set intersection to make the list comprehension shorter and more efficient:

element_counts = collections.Counter(itertools.chain.from_iterable(allsets))
all_uniques = {elem for elem, count in element_counts.items() if count == 1}
#                                                     ^ viewitems() in Python 2.7
nondupes = [original & all_uniques for original in allsets]

Timing seems to indicate that using an all_uniques set produces a substantial speedup for the overall duplicate-elimination process. It's up to about a 3.5x speedup on Python 3 for heavily-duplicate input sets, though only about a 30% speedup for the overall duplicate-elimination process on Python 2 due to more of the runtime being dominated by constructing the Counter. This speedup is fairly substantial, though not nearly as important as avoiding quadratic runtime by using element_counts in the first place. If you're on Python 2 and this code is speed-critical, you'd want to use an ordinary dict or a collections.defaultdict instead of a Counter.

Another way would be to construct a dupes set from element_counts and use original - dupes instead of original & all_uniques in the list comprehension, as suggested by munk. Whether this performs better or worse than using an all_uniques set and & would depend on the degree of duplication in your input and what Python version you're on, but it doesn't seem to make much of a difference either way.

Certainly a better way. Some links for the OP 1. [`chain.from_iterable`](https://docs.python.org/3/library/itertools.html#itertools.chain.from_iterable) 2. [`collections.Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) — Bhargav Rao, Jan 29 '16 at 20:36
Literal syntax could be a little nicer with [{elem for elem in original...}] — munk, Jan 29 '16 at 20:37
@munk: Oh, right. I keep forgetting to use set literals and set comprehensions. — user2357112, Jan 29 '16 at 20:39
Intersecting with unique elements is about 6x faster than subtracting duplicates on my real world data set. In my dataset, the unique elements are rare and the duplicates are plentiful. — Steve, Feb 01 '16 at 14:25

score 8 · Answer 2 · answered Jan 29 '16 at 20:27

8

Yes it can be done but is hardly pythonic

>>> [(i-set.union(*[j for j in allsets if j!= i])) for i in allsets]   
[set([1, 2]), set([6]), set([7])]

Some reference on sets can be found in the documentation. The * operator is called unpacking operator.

answered Jan 29 '16 at 20:27

Bhargav Rao

50,140
28
121
140

2

eww agreed. Avoid this like the plague. Prefer some verbose for loops (but great work Bhargav!) – Adam Smith Jan 29 '16 at 20:30
You don't need the inner list – Padraic Cunningham Jan 29 '16 at 20:33
@PadraicCunningham You'd prefer a genexp there? – Bhargav Rao Jan 29 '16 at 20:34

score 6 · Answer 3 · answered Jan 29 '16 at 20:58

6

A slightly different solution using Counter and comprehensions, to take advantage of the - operator for set difference.

from itertools import chain
from collections import Counter

allsets = [{1, 2, 4}, {4, 5, 6}, {4, 5, 7}]
element_counts = Counter(chain.from_iterable(allsets))

dupes = {key for key in element_counts 
         if element_counts[key] > 1}

only = [s - dupes for s in allsets]

answered Jan 29 '16 at 20:58

munk

12,340
8
51
71

1

I actually thought about that after I posted my original solution, though I used `&` and made a `unique_elements` set instead of a `dupes` set. [Timing](http://ideone.com/8b70l4) showed `&` to be about 30% faster than running a Python-level set comprehension every time. Whether `&` or `-` performs better probably depends on the degree of element duplication and what Python version you're on. – user2357112 Jan 29 '16 at 21:03
Selecting this solution as the best answer because 1) it is very readable, 2) 15%-30% faster than the user2357112 solution on my real world data – Steve Feb 01 '16 at 13:54
Very nice and readable solution. I originally selected this as the best answer based on readability and speed. Later changed to user2357112's answer, which upon further testing is significantly faster. – Steve Feb 01 '16 at 14:10

score 2 · Answer 4 · answered Jan 29 '16 at 20:31

2

Another solution with itertools.chain:

>>> from itertools import chain
>>> [x - set(chain(*(y for y in allsets if y!=x))) for x in allsets]
[set([1, 2]), set([6]), set([7])]

Also doable without the unpacking and using chain.from_iterable instead.

answered Jan 29 '16 at 20:31

timgeb

76,762
20
123
145

Subtraction over a list of sets

4 Answers4

Linked

Related