Efficiently remove duplicates, order-agnostic, from list of lists

Question

The following list has some duplicated sublists, with elements in different order:

l1 = [
    ['The', 'quick', 'brown', 'fox'],
    ['hi', 'there'],
    ['jumps', 'over', 'the', 'lazy', 'dog'],
    ['there', 'hi'],
    ['jumps', 'dog', 'over','lazy', 'the'],
]

How can I remove duplicates, retaining the first instance seen, to get:

l1 = [
    ['The', 'quick', 'brown', 'fox'],
    ['hi', 'there'],
    ['jumps', 'over', 'the', 'lazy', 'dog'],
]

I tried to:

[list(i) for i in set(map(tuple, l1))]

Nevertheless, I do not know if this is the fastest way of doing it for large lists, and my attempt is not working as desired. Any idea of how to remove them efficiently?

wim · Accepted Answer · 2019-08-12T18:52:38.183

5

This one is a little tricky. You want to key a dict off of frozen counters, but counters are not hashable in Python. For a small degradation in the asymptotic complexity, you could use sorted tuples as a substitute for frozen counters:

seen = set()
result = []
for x in l1:
    key = tuple(sorted(x))
    if key not in seen:
        result.append(x)
        seen.add(key)

The same idea in a one-liner would look like this:

[*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]

edited Aug 12 '19 at 18:52

answered Aug 12 '19 at 18:21

wim

338,267
99
616
750

Thanks for the help – anon Aug 12 '19 at 18:32
You can get away with `set(tuple(sorted(x)) for x in l1)` if you don't care about preserving the order of entries in l1. – smci Aug 13 '19 at 06:49

score 3 · Answer 2 · edited Aug 13 '19 at 06:52

I did a quick benchmark, comparing the various answers:

l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]

from collections import Counter

def method1():
    """manually construct set, keyed on sorted tuple"""
    seen = set()
    result = []
    for x in l1:
        key = tuple(sorted(x))
        if key not in seen:
            result.append(x)
            seen.add(key)
    return result

def method2():
    """frozenset-of-Counter"""
    return list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())

def method3():
    """wim"""
    return [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]

from timeit import timeit

print(timeit(lambda: method1(), number=1000))
print(timeit(lambda: method2(), number=1000))
print(timeit(lambda: method3(), number=1000))

Prints:

0.0025010189856402576
0.016385524009820074
0.0026451340527273715

This is sort of problematic because the strings are so tiny (3 or 4 characters each) that O(n log(n)) hardly matters relative to the counter approach, which has better asymptotic complexity but more allocation overhead. I think a fairer approach would be benchmarking with thousand-length strings as well. — ggorlen, Jun 03 '20 at 02:19

blhsing · Answer 3 · 2019-08-12T18:37:53.313

2

@wim's answer is inefficient since it sorts the list items as a way to uniquely identify a set of counts of list items, which costs O(n log n) in time complexity for each sublist.

To achieve the same in a linear time complexity, you can use a frozenset of counts of items with the collections.Counter class instead. Since dict comprehension retains the last value of items with duplicating keys, and since you want to retain the first value of items with duplicating keys in your question, you would have to construct the dict in reverse order of the list, and reverse it again after the list of de-duplicated sublists has been constructed:

from collections import Counter
list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())[::-1]

This returns:

[['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog']]

edited Aug 12 '19 at 18:37

answered Aug 12 '19 at 18:32

blhsing

91,368
6
71
106

1

Thanks for the help – anon Aug 12 '19 at 18:37
2

I considered this, but unless the *inner* lists in the data are very long then the overhead from building all those frozenset and Counter instances is likely to be significantly worse than just sorting in the first place. i.e. a large coefficient on the O(k) is probably gonna be worse than an asymptotic O(k * log k) in practice. – wim Aug 12 '19 at 18:46
1

@wim If you had considered this, you would not have said that "counters are not hashable in Python", and that your solution is at a cost of "a small degradation in performance". Without knowing the actual use case of the OP, we should always err on the side of better scalability. – blhsing Aug 12 '19 at 18:52
2

*Scalability in which axis, though?* I think the most obvious big data case here would be a very long list of shorter inner lists, in which case your solution will have the *worse* scalability. A degradation in asymptotic complexity is not necessarily a degradation in performance - you'd have to measure with the real data to make those conclusions. – wim Aug 12 '19 at 19:02

score 2 · Answer 4 · answered Aug 12 '19 at 18:36

2

This:

l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
s = {tuple(item) for item in map(sorted, l1)}
l2 = [list(item) for item in s]

l2 gives the list with reverse duplicates removed. Compare with: Pythonic way of removing reversed duplicates in list

answered Aug 12 '19 at 18:36

John B. Walugembe

589
4
12

@wim could you please explain how the output is incorrect? I've checked and it seems I get a correct output (three nested lists) – John B. Walugembe Aug 12 '19 at 19:53
The expected output is written in the question, this one is not matching because ordering was lost – wim Aug 12 '19 at 19:55
Well, the ordering is changed, but the essence of the question seems to have been on removing lists with similar elements. – John B. Walugembe Aug 12 '19 at 20:04

Efficiently remove duplicates, order-agnostic, from list of lists

4 Answers4

Linked

Related