1

I've got the following list of nested dictionaries:

[{'permission': 'full',
  'permission_type': 'allow',
  'trustee': {'id': 'SID:S-1-5-32-545',
              'name': 'Users',
              'type': 'group'}},
 {'permission': 'full',
  'permission_type': 'allow',
  'trustee': {'id': 'SID:S-1-5-32-545',
              'name': 'Users',
              'type': 'group'}},
 {'permission': 'full',
  'permission_type': 'allow',
  'trustee': {'id': 'SID:S-1-5-32-544',
              'name': 'Administrators',
              'type': 'group'}}]

I want to make it unique and have tried different suggestions with no success. Can someone help to make it unique in python 2.6? There is no key/unique field in the data above. I expect the following result (one member of the list is removed as a full duplicate):

[{'permission': 'full',
  'permission_type': 'allow',
  'trustee': {'id': 'SID:S-1-5-32-545',
              'name': 'Users',
              'type': 'group'}},
 {'permission': 'full',
  'permission_type': 'allow',
  'trustee': {'id': 'SID:S-1-5-32-544',
              'name': 'Administrators',
              'type': 'group'}}]
Intra
  • 61
  • 9
  • 2
    can you define "uniqify" exactly? – Grijesh Chauhan Dec 09 '14 at 08:25
  • Unique by what measure? That all values for given keys are the same? Or are is it fuzzier than that? – Martijn Pieters Dec 09 '14 at 08:26
  • 1
    remove duplicate items where the dictionaries contain the same data – Intra Dec 09 '14 at 08:26
  • the top 2 list items are identical – Intra Dec 09 '14 at 08:27
  • 1
    @Intra which one should removed? remember list is an ordered sequence – Grijesh Chauhan Dec 09 '14 at 08:28
  • I've edited the question, order of the list is not important – Intra Dec 09 '14 at 08:31
  • Please specify in your question unique by `id`'s `trustee` field. – Michele d'Amico Dec 09 '14 at 08:36
  • If your dictionaries have the same keys in all instances, a simple way to do it would be to do: `list(set([str(c) for c in my_list]))` – Asish M. Dec 09 '14 at 08:36
  • @M.Klugerford: that depends on the insertion order of the keys being the same then; the way dictionaries list key-values can *differ* based on their insertion and deletion history. You'd have equal dictionaries, but *different string representations*. Besides, your method builds a set of unique strings, not actual dictionaries. – Martijn Pieters Dec 09 '14 at 08:41
  • @Micheled'Amico: Presumably it also needs to be unique by the `permission` and `permission_type` fields. E.g. **all key-value combinations**. – Martijn Pieters Dec 09 '14 at 08:43
  • @MartijnPieters That is true, however, doesn't recasting an existing dict into a new one essentially eliminate the difference in insertion / deletion order? In that case, will `str(dict(c))` instead of `dict(c)` be valid? Oh and true, my method builds a set of strings. – Asish M. Dec 09 '14 at 08:45
  • @M.Klugerford: sure, but you still end up with duplicates then, as the strings are used to track unique dictionaries. Those strings will still differ. – Martijn Pieters Dec 09 '14 at 08:46
  • Surely doable in O(N^2), but can it be done faster? I.e. hash a hairy POD structure canonically somehow? – Dima Tisnek Dec 09 '14 at 09:07

2 Answers2

8

You'd need to track if you have seen a dictionary already. Unfortunately, dictionaries are not hashable, and do not track order, so you need to convert dictionaries to something that is hashable. A frozenset() of the key-value pairs (as tuples) would do, but then you need to flatten recursively:

def set_from_dict(d):
    return frozenset(
        (k, set_from_dict(v) if isinstance(v, dict) else v)
        for k, v in d.iteritems())

These frozenset() objects represent the dictionary values enough to track unique items:

seen = set()
result = []
for d in inputlist:
    representation = set_from_dict(d)
    if representation in seen:
        continue
    result.append(d)
    seen.add(representation)

This preserves the original order of your input list, minus duplicates. If you are using Python 2.7 and up, an OrderedDict would have been helpful here, but you are using Python 2.6, so we need to do it slightly more verbosely.

The above approach takes O(N) time, one step per input dictionary, as testing against a set takes only O(1) constant time.

Demo:

>>> inputlist = [{'permission': 'full',
...   'permission_type': 'allow',
...   'trustee': {'id': 'SID:S-1-5-32-545',
...               'name': 'Users',
...               'type': 'group'}},
...  {'permission': 'full',
...   'permission_type': 'allow',
...   'trustee': {'id': 'SID:S-1-5-32-545',
...               'name': 'Users',
...               'type': 'group'}},
...  {'permission': 'full',
...   'permission_type': 'allow',
...   'trustee': {'id': 'SID:S-1-5-32-544',
...               'name': 'Administrators',
...               'type': 'group'}}]
>>> def set_from_dict(d):
...     return frozenset(
...         (k, set_from_dict(v) if isinstance(v, dict) else v)
...         for k, v in d.iteritems())
... 
>>> seen = set()
>>> result = []
>>> for d in inputlist:
...     representation = set_from_dict(d)
...     if representation in seen:
...         continue
...     result.append(d)
...     seen.add(representation)
... 
>>> from pprint import pprint
>>> pprint(result)
[{'permission': 'full',
  'permission_type': 'allow',
  'trustee': {'id': 'SID:S-1-5-32-545', 'name': 'Users', 'type': 'group'}},
 {'permission': 'full',
  'permission_type': 'allow',
  'trustee': {'id': 'SID:S-1-5-32-544',
              'name': 'Administrators',
              'type': 'group'}}]
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Unfortunately I cannot install anything on the system and limited to python 2.6 and don't have "collections", I've tried to run set_from_dict on my list and getting the following error: AttributeError: 'list' object has no attribute 'iteritems' – Intra Dec 09 '14 at 08:36
  • @Intra: already adjusted for. :-) – Martijn Pieters Dec 09 '14 at 08:37
  • Thanks, it looks like the most efficient way of doing it! – Intra Dec 09 '14 at 08:50
1

Your items are dict so you won't be able to use set directly (check frozenset or this question/answer).
But you still can compare the items:

>>> l[0]==l[1]
True
>>> l[0]==l[2]
False

So simply add your elements to a new list if it's not already present:

>>> l2=[]
>>> for i in l:
...   if i not in l2:
...     l2.append(i)
...
>>> pprint(l2)
[{'permission': 'full',
  'permission_type': 'allow',
  'trustee': {'id': 'SID:S-1-5-32-545', 'name': 'Users', 'type': 'group'}},
 {'permission': 'full',
  'permission_type': 'allow',
  'trustee': {'id': 'SID:S-1-5-32-544',
              'name': 'Administrators',
              'type': 'group'}}]
Community
  • 1
  • 1
fredtantini
  • 15,966
  • 8
  • 49
  • 55
  • 2
    This takes quadratic time as you now test *each new dictionary* against all dictionaries seen so far. – Martijn Pieters Dec 09 '14 at 08:33
  • 1
    What Martijn said; it's ok for short lists, where it has less overhead than Martijn's method, although it will get bogged down for large lists. Perhaps we need a timeit showdown to see where the break-even point is... – PM 2Ring Dec 09 '14 at 08:36
  • @PM2Ring: the dictionary comparisons do the same work my conversion to sets does, when comparing values; the keys are the same here. The only difference is that the comparison is done in C. But this method does far more comparisons; that 'break-even' point will be lower than you think. – Martijn Pieters Dec 09 '14 at 08:38