Removing Duplicates From Dictionary

Question

I have the following Python 2.7 dictionary data structure (I do not control source data - comes from another system as is):

{112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   },
 112765083670: ...
}

The dictionary keys will always be unique. Dst, src, and alias can be duplicates. All records will always have a dst and src but not every record will necessarily have an alias as seen in the third record.

In the sample data either of the first two records would be removed (doesn't matter to me which one). The third record would be considered unique since although dst and src are the same it is missing alias.

My goal is to remove all records where the dst, src, and alias have all been duplicated - regardless of the key.

How does this rookie accomplish this?

Also, my limited understanding of Python interprets the data structure as a dictionary with the values stored in dictionaries... a dict of dicts, is this correct?

Andrew Cox · Accepted Answer · 2012-01-05T20:57:58.150

53

You could go though each of the items (the key value pair) in the dictionary and add them into a result dictionary if the value was not already in the result dictionary.

input_raw = {112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   }
}

result = {}

for key,value in input_raw.items():
    if value not in result.values():
        result[key] = value

print result

edited Jan 05 '12 at 20:57

answered Jan 05 '12 at 20:40

Andrew Cox

10,672
3
33
38

15

This is a good starting point, but I feel compelled to point out that it will be slow for large collections of data, because with every loop, it creates a new list of values and does a linear search over it. – senderle Jan 05 '12 at 21:23
@senderle: I appreciate your thought and comment regarding the speed and will take thin into consideration if necessary. Do you care to expand on this answer to increase performance? – Bit Bucket Jan 05 '12 at 21:27
This doesn't answer the question as posed. – joel3000 Jan 05 '12 at 21:45
Looks fine to me Joel. What's wrong with it? *edit:* I think I see where your concern is, but `in` will work fine here. – Brigand Jan 05 '12 at 21:54
@FakeRainBrigand - it does not delete only the entries for which dst, src, and alias are duplicates. 'My goal is to remove all records where the dst, src, and alias have all been duplicated - regardless of the key.' My answer and some of the others below do that. – joel3000 Jan 05 '12 at 22:18
@joel3000 I still don't see why Andrew Cow's doesn't make what is asked. It is not a good answer because of algorithm, as underlined by senderle, but it works correctly, I think. Your answer is as much complicated and suffering of bad performance, in my opinion. – eyquem Jan 06 '12 at 01:08
@eyquem. The current solution given by joel3000, doesn't actually work at all. It tries to use lists as keys - but even if they are converted to tuples, the output is still completely wrong. – ekhumoro Jan 06 '12 at 01:59
@ekhumoro OK, thank you. I didn't see that, I didn't execute his code and didn't go detailing his answer. – eyquem Jan 06 '12 at 02:05
@ekhumoro- the question asks for this - "My goal is to remove all records where the dst, src, and alias have all been duplicated - regardless of the key." That means the first two entries should be deleted and the only the last two shown. Andrew's does not do that. Mine does. I fixed the bug. FWIW. – joel3000 Jan 08 '12 at 05:30

senderle · Answer 2 · 2012-01-05T21:28:08.167

One simple approach would be to create a reverse dictionary using the concatenation of the string data in each inner dictionary as a key. So say you have the above data in a dictionary, d:

>>> import collections
>>> reverse_d = collections.defaultdict(list)
>>> for key, inner_d in d.iteritems():
...     key_str = ''.join(inner_d[k][0] for k in ['dst', 'src', 'alias'] if k in inner_d)
...     reverse_d[key_str].append(key)
... 
>>> duplicates = [keys for key_str, keys in reverse_d.iteritems() if len(keys) > 1]
>>> duplicates
[[112762853385, 112762853378]]

If you don't want a list of duplicates or anything like that, but just want to create a duplicate-less dict, you could just use a regular dictionary instead of a defaultdict and re-reverse it like so:

>>> for key, inner_d in d.iteritems():
...     key_str = ''.join(inner_d[k][0] for k in ['dst', 'src', 'alias'] if k in inner_d)
...     reverse_d[key_str] = key
>>> new_d = dict((val, d[val]) for val in reverse_d.itervalues())

eyquem · Answer 3 · 2012-01-06T01:40:14.007

input_raw = {112762853378:  {'dst': ['10.121.4.136'],
                             'src': ['1.2.3.4'],
                             'alias': ['www.example.com']    },
             112762853385:  {'dst': ['10.121.4.136'],
                             'src': ['1.2.3.4'],
                             'alias': ['www.example.com']    },
             112760496444:  {'dst': ['10.121.4.299'],
                             'src': ['1.2.3.4']    },
             112760496502:  {'dst': ['10.122.195.34'],
                             'src': ['4.3.2.1']    },
             112758601487:  {'src': ['1.2.3.4'],
                             'alias': ['www.example.com'],
                             'dst': ['10.121.4.136']},
             112757412898:  {'dst': ['10.122.195.34'],
                             'src': ['4.3.2.1']    },
             112757354733:  {'dst': ['124.12.13.14'],
                             'src': ['8.5.6.0']},             
             }

for x in input_raw.iteritems():
    print x
print '\n---------------------------\n'

seen = []

for k,val in input_raw.items():
    if val in seen:
        del input_raw[k]
    else:
        seen.append(val)


for x in input_raw.iteritems():
    print x

result

(112762853385L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})
(112757354733L, {'src': ['8.5.6.0'], 'dst': ['124.12.13.14']})
(112758601487L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})
(112757412898L, {'src': ['4.3.2.1'], 'dst': ['10.122.195.34']})
(112760496502L, {'src': ['4.3.2.1'], 'dst': ['10.122.195.34']})
(112760496444L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.299']})
(112762853378L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})

---------------------------

(112762853385L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})
(112757354733L, {'src': ['8.5.6.0'], 'dst': ['124.12.13.14']})
(112757412898L, {'src': ['4.3.2.1'], 'dst': ['10.122.195.34']})
(112760496444L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.299']})

The facts that this solution creates first a list input_raw.iteritems() (as in Andrew's Cox's answer) and requires a growing list seen are drawbacks.
But the first can't be avoided (using iteritems() doesn't work) and the second is less heavy than re-creating a list result.values() from growing list result for each turn of a loop.

ekhumoro · Answer 4 · 2012-01-06T13:17:37.277

Another reverse dict variation:

>>> import pprint
>>> 
>>> data = {
...   112762853378: 
...    {'dst': ['10.121.4.136'], 
...     'src': ['1.2.3.4'], 
...     'alias': ['www.example.com']
...    },
...  112762853385: 
...    {'dst': ['10.121.4.136'], 
...     'src': ['1.2.3.4'], 
...     'alias': ['www.example.com']
...    },
...  112760496444: 
...    {'dst': ['10.121.4.136'], 
...     'src': ['1.2.3.4']
...    },
...  112760496502: 
...    {'dst': ['10.122.195.34'], 
...     'src': ['4.3.2.1']
...    },
... }
>>> 
>>> keep = set({repr(sorted(value.items())):key
...             for key,value in data.iteritems()}.values())
>>> 
>>> for key in data.keys():
...     if key not in keep:
...         del data[key]
... 
>>> 
>>> pprint.pprint(data)
{112760496444L: {'dst': ['10.121.4.136'], 'src': ['1.2.3.4']},
 112760496502L: {'dst': ['10.122.195.34'], 'src': ['4.3.2.1']},
 112762853378L: {'alias': ['www.example.com'],
                 'dst': ['10.121.4.136'],
                 'src': ['1.2.3.4']}}

Seems like this would count `{'src':['1.2.3.4'], 'dst':['10.121.3.1236']}` and `{'src':['10.121.3.1236'], 'dst':['1.2.3.4']}` as duplicates of one another... — senderle, Jan 06 '12 at 07:30
@senderle. Well spotted! Fixed that now, FWIW. I should probably also point out that this solution, although compact, is pretty inefficient compared to some of the others. — ekhumoro, Jan 06 '12 at 13:20

jsbueno · Answer 5 · 2012-01-05T20:46:13.197

Since the way to find uniqueness in correspondences is exactly to use a dictionary, with the desired unique value being the key, the way to go is to create a reversed dict, where your values are composed as the key - then recreate a "de-reversed" dictionary using the intermediate result.

dct = {112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   },
   }

def remove_dups (dct):
    reversed_dct = {}
    for key, val in dct.items():
        new_key = tuple(val["dst"]) + tuple(val["src"]) + (tuple(val["alias"]) if "alias" in val else (None,) ) 
        reversed_dct[new_key] = key
    result_dct = {}
    for key, val in reversed_dct.items():
        result_dct[val] = dct[val]
    return result_dct

result = remove_dups(dct)

joel3000 · Answer 6 · 2012-01-06T03:50:57.100

2

dups={}

for key,val in dct.iteritems():
    if val.get('alias') != None:
        ref = "%s%s%s" % (val['dst'] , val['src'] ,val['alias'])# a simple hash
        dups.setdefault(ref,[]) 
        dups[ref].append(key)

for k,v in dups.iteritems():
    if len(v) > 1:
        for key in v:
            del dct[key]

edited Jan 06 '12 at 03:50

answered Jan 05 '12 at 20:56

joel3000

1,249
11
22

Had to update this. Should work now, if I understand the question correctly. – joel3000 Jan 05 '12 at 21:40

score 1 · Answer 7 · edited Mar 06 '20 at 20:28

1

I solved it using compressed dictionary method:

dic = {112762853378: 
    {'dst': ['10.121.4.136'], 
     'src': ['1.2.3.4'], 
     'alias': ['www.example.com']
    },
112762853385: 
    {'dst': ['10.121.4.136'], 
     'src': ['1.2.3.4'], 
     'alias': ['www.example.com']
    },
112760496444: 
    {'dst': ['10.121.4.136'], 
     'src': ['1.2.3.4']
    },
112760496502: 
    {'dst': ['10.122.195.34'], 
     'src': ['4.3.2.1']
    }
}

result = {k:v for k,v in dic.items() if list(dic.values()).count(v)==1}

edited Mar 06 '20 at 20:28

kaya3

47,440
4
68
97

answered Mar 06 '20 at 19:19

Heriberto Ortiz Hernandez

11
1

1

But surely it removes *all* occurences of keys with duplicated values... – mirekphd Apr 20 '22 at 15:21

reclosedev · Answer 8 · 2012-01-05T21:25:09.127

1

from collections import defaultdict

dups = defaultdict(lambda : defaultdict(list))

for key, entry in data.iteritems():
    dups[tuple(entry.keys())][tuple([v[0] for v in entry.values()])].append(key)

for dup_indexes in dups.values():
    for keys in dup_indexes.values():
        for key in keys[1:]:
            if key in data:
                del data[key]

edited Jan 05 '12 at 21:25

answered Jan 05 '12 at 20:43

reclosedev

9,352
34
51

2

The complexity of this is O(n^3)! – FaCoffee May 04 '18 at 14:16

score 0 · Answer 9 · answered Aug 28 '19 at 03:25

I would just make a set of the list of keys then iterate over them into a new dict:

input_raw = {112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   }
}

filter = list(set(list(input_raw.keys())))

fixedlist = {}

for i in filter:
    fixedlist[i] = logins[i]

score -1 · Answer 10 · edited Jun 18 '19 at 18:48

-1

You can use

set(dictionary)

to solve your problem.

edited Jun 18 '19 at 18:48

Bhargav Rao

50,140
28
121
140

answered Jun 18 '19 at 18:45

Joao Vitor Deon

117
1
5

4

this might cause: TypeError: unhashable type: 'dict' – Barak Schoster Oct 23 '19 at 10:54

score -3 · Answer 11 · edited Jan 31 '18 at 14:09

-3

example = {
    'id1':  {'name': 'jay','age':22,},
    'id2': {'name': 'salman','age': 52,},
    'id3': {'name':'Ranveer','age' :26,},
    'id4': {'name': 'jay', 'age': 22,},
}
for item in example:
    for value in example:
        if example[item] ==example[value]:
            if item != value:
                 key = value 
                 del example[key]
print "example",example

edited Jan 31 '18 at 14:09

Gianluca

3,227
2
34
35

answered Jan 31 '18 at 12:34

chandresh thakor

1

1

Please format your answer with the `{}` button, format matters in Python. And it is a very bad idea to modify lists or dictionaries, while iterating over them. Very bad. – Mr. T Jan 31 '18 at 12:44
Welcome to StackOverflow: if you post code, XML or data samples, please highlight those lines in the text editor and click on the "code samples" button ( { } ) on the editor toolbar or using Ctrl+K on your keyboard to nicely format and syntax highlight it! – WhatsThePoint Jan 31 '18 at 12:58
... and nested `for` loops in any scripting language is a bad idea... It is also time to move on from python 2.x – mirekphd Apr 20 '22 at 15:25

Removing Duplicates From Dictionary

11 Answers11

Linked

Related