2

What is the fastest\ most pythonic way to check if a List is a mathematical set in python?

I know the following works:

ListInstance = [1,2,3,4,5,6]
ListIsMathSet = (len(set(ListInstance)) == len(ListInstance) )

Is there a better/faster way to check this?

D A
  • 3,130
  • 4
  • 25
  • 41
  • I don't think you can do much better unless you can make some guarantees about the input list ... e.g. if it's sorted. – mgilson Oct 14 '15 at 22:28
  • 2
    Is there some *problem* with what you have? Is it a bottleneck? And why worry about the code being Pythonic when your variable names aren't? – jonrsharpe Oct 14 '15 at 22:29
  • 1
    Unless you're working with such large datasets that the overhead for hashing them is becoming prohibitive, I would just use that. Since you've already written it, the development time for it compared to other approaches is zero. – TigerhawkT3 Oct 14 '15 at 22:31
  • @jonrsharpe does it really make a big difference how he capitalizes his variable names? – Jacob Ritchie Oct 15 '15 at 07:55
  • 2
    They can do what they like in isolation, but if you're sharing your code with others [PEP-8](https://www.python.org/dev/peps/pep-0008/) without good reason otherwise. – jonrsharpe Oct 15 '15 at 07:58
  • Feel free to edit the variable names in my question if it makes a difference – D A Oct 18 '15 at 19:29

1 Answers1

2

It's not usually going to be faster, but if the values aren't hashable but they are comparable, and especially if they're already sorted, you can lazily determine if any elements are non-unique:

def is_unique(items, key=None):
    for k, g in itertools.groupby(sorted(items, key=key), key=key):
        if len(list(itertools.islice(g, 2))) > 1:
            return False
    return True

This will stop as soon as the first duplicate is detected and checks no more than necessary, which may run faster (particularly in the "input already sorted" case). A similar early out based approach can be made using set by iterating as you go to minimize the number of elements hashed and stored in the case where uniqueness is violated quickly, by doing this (adapted from the unique_everseen recipe in itertools):

def is_unique(iterable):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        if element in seen:
            return False
        seen_add(element)
    return True

Note: Neither of the above solutions is better in the typical case of a small number of hashable inputs where uniqueness is common (or at least, not violated early in the set of inputs). The simple solution you gave is concise, obvious, and performs most of the work at the C layer in CPython, so it has a much lower fixed overhead compared to methods that execute a lot of Python code. But they may be useful for large inputs, already sorted inputs, and/or inputs where uniqueness is uncommon (and therefore the early out behavior saves you some work).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Note: There is an alternative to the `sorted` based approach based on `heapq` that avoids up front `O(n log n)` work in the case where the input isn't already sorted by paying only a `O(n)` initial cost for `heapq.heapify`, then popping off values on at a time (`O(log n)` per pop) and comparing the popped value to `theheap[0]` looking for duplicates. It's rarely worthwhile though; `sorted` is so much faster than hand implementing a lazy sort using `heapq` that it's only worthwhile if you'll usually find non-unique elements in the first small fraction (IIRC, about a sixth?) of the input. – ShadowRanger Oct 15 '15 at 00:14
  • I don't have the reputation to edit your answer: I am basically seeing that if the List Instance you start with is sorted, then yes you can do better. You iterate through the thing, and check if two neighboring elements are equal. An iterating neighbor equality check method should take N time on a sorted list. – D A Oct 18 '15 at 19:39
  • @DAdams: Correct. And the nature of Python's TimSort algorithm is that sorting an already sorted list is proportionate to N work. If you know it's sorted, it's a little faster to not sort again, but only a little. – ShadowRanger Oct 18 '15 at 21:07