Efficiently compute xor / symmetric difference of many sets (list of sets)

Question

I have an arbitrary number of Python sets, e.g.

>>> a = {1, 2, 3}
>>> b = {3, 4, 5}
>>> c = {5, 6, 7}
>>> d = {7, 8, 1}

I want to compute their "combined" symmetric difference, i.e. I want to xor all of them:

>>> a ^ b ^ c ^ d
{2, 4, 6, 8}

In my use-case, I am actually dealing with lists of sets:

>>> l = [a, b, c, d]
>>> l
[{1, 2, 3}, {3, 4, 5}, {5, 6, 7}, {1, 7, 8}]

Currently, I am iterating across the list in order to achieve what I want:

>>> res = l[0].copy()
>>> for item in l[1:]:
...     res.symmetric_difference_update(item)
>>> res
{2, 4, 6, 8}

I am wondering whether there is a more efficient method, ideally without going through a Python for-loop. Set-operations are actually really fast in Python, but my lists can become rather long, so the for-loop itself ironically becomes a bottleneck.

EDIT (1)

I am assuming that every possible entry of all sets in my list do occur not more than twice across all sets in my list.

EDIT (2)

Some benchmarks:

from typing import List, Set
from functools import reduce
from collections import defaultdict

length = 1_000
data = [
    {idx - 1, idx, idx + 1}
    for idx in range(3_000, 3_000 + length * 2, 2)
]

def test_loop1(l: List[Set[int]]) -> Set[int]:
    res = l[0].copy()
    for item in l[1:]:
        res.symmetric_difference_update(item)
    assert len(res) == len(l) + 2
    return res

test_loop1: 121 µs ± 321 ns

def test_loop2(l: List[Set[int]]) -> Set[int]:
    res = set()
    for item in l:
        res.symmetric_difference_update(item)
    assert len(res) == len(l) + 2
    return res

test_loop2: 112 µs ± 3.16 µs

def test_reduce1(l: List[Set[int]]) -> Set[int]:
    res = reduce(Set.symmetric_difference, l)
    assert len(res) == len(l) + 2
    return res

test_reduce1: 9.89 ms ± 20.6 µs

def test_dict1(l: List[Set[int]]) -> Set[int]:
    """
    A general solution allowing for entries to occur more than twice in the input data
    """
    d = defaultdict(int)
    for item in l:
        for entry in item:
            d[entry] += 1
    res = {entry for item in l for entry in item if d[entry] == 1}
    assert len(res) == len(l) + 2
    return res

test_dict1: 695 µs ± 5.11 µs

The real problem with your loop is that you're copying the list before running it, i.e. `l[1:]` is performing a list copy. You can avoid it by using a range for the loop, beginning with 1, then indexing into the list. That way you don't have to allocate a copy of the list. If you really want to avoid the explicit loop, you can use `functools.reduce` to perform your reduction, but it probably won't be faster once you fix the list-copying problem. — Tom Karzes, Jun 01 '21 at 13:00
@TomKarzes Yep, you are right. The copying, although shallow, is actually contributing a bit. `reduce` is also actually slower, although I might have chosen a non-ideal way to call it. I added some benchmarks to my question. — s-m-e, Jun 01 '21 at 15:04

Efficiently compute xor / symmetric difference of many sets (list of sets)

0 Answers0