43

I have the following Python 2.7 dictionary data structure (I do not control source data - comes from another system as is):

{112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   },
 112765083670: ...
}

The dictionary keys will always be unique. Dst, src, and alias can be duplicates. All records will always have a dst and src but not every record will necessarily have an alias as seen in the third record.

In the sample data either of the first two records would be removed (doesn't matter to me which one). The third record would be considered unique since although dst and src are the same it is missing alias.

My goal is to remove all records where the dst, src, and alias have all been duplicated - regardless of the key.

How does this rookie accomplish this?

Also, my limited understanding of Python interprets the data structure as a dictionary with the values stored in dictionaries... a dict of dicts, is this correct?

Ken Y-N
  • 14,644
  • 21
  • 71
  • 114
Bit Bucket
  • 942
  • 4
  • 10
  • 13

11 Answers11

53

You could go though each of the items (the key value pair) in the dictionary and add them into a result dictionary if the value was not already in the result dictionary.

input_raw = {112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   }
}

result = {}

for key,value in input_raw.items():
    if value not in result.values():
        result[key] = value

print result
Andrew Cox
  • 10,672
  • 3
  • 33
  • 38
  • 15
    This is a good starting point, but I feel compelled to point out that it will be slow for large collections of data, because with every loop, it creates a new list of values and does a linear search over it. – senderle Jan 05 '12 at 21:23
  • @senderle: I appreciate your thought and comment regarding the speed and will take thin into consideration if necessary. Do you care to expand on this answer to increase performance? – Bit Bucket Jan 05 '12 at 21:27
  • This doesn't answer the question as posed. – joel3000 Jan 05 '12 at 21:45
  • Looks fine to me Joel. What's wrong with it? *edit:* I think I see where your concern is, but `in` will work fine here. – Brigand Jan 05 '12 at 21:54
  • @FakeRainBrigand - it does not delete only the entries for which dst, src, and alias are duplicates. 'My goal is to remove all records where the dst, src, and alias have all been duplicated - regardless of the key.' My answer and some of the others below do that. – joel3000 Jan 05 '12 at 22:18
  • @joel3000 I still don't see why Andrew Cow's doesn't make what is asked. It is not a good answer because of algorithm, as underlined by senderle, but it works correctly, I think. Your answer is as much complicated and suffering of bad performance, in my opinion. – eyquem Jan 06 '12 at 01:08
  • @eyquem. The current solution given by joel3000, doesn't actually work at all. It tries to use lists as keys - but even if they are converted to tuples, the output is still completely wrong. – ekhumoro Jan 06 '12 at 01:59
  • @ekhumoro OK, thank you. I didn't see that, I didn't execute his code and didn't go detailing his answer. – eyquem Jan 06 '12 at 02:05
  • @ekhumoro- the question asks for this - "My goal is to remove all records where the dst, src, and alias have all been duplicated - regardless of the key." That means the first two entries should be deleted and the only the last two shown. Andrew's does not do that. Mine does. I fixed the bug. FWIW. – joel3000 Jan 08 '12 at 05:30
6

One simple approach would be to create a reverse dictionary using the concatenation of the string data in each inner dictionary as a key. So say you have the above data in a dictionary, d:

>>> import collections
>>> reverse_d = collections.defaultdict(list)
>>> for key, inner_d in d.iteritems():
...     key_str = ''.join(inner_d[k][0] for k in ['dst', 'src', 'alias'] if k in inner_d)
...     reverse_d[key_str].append(key)
... 
>>> duplicates = [keys for key_str, keys in reverse_d.iteritems() if len(keys) > 1]
>>> duplicates
[[112762853385, 112762853378]]

If you don't want a list of duplicates or anything like that, but just want to create a duplicate-less dict, you could just use a regular dictionary instead of a defaultdict and re-reverse it like so:

>>> for key, inner_d in d.iteritems():
...     key_str = ''.join(inner_d[k][0] for k in ['dst', 'src', 'alias'] if k in inner_d)
...     reverse_d[key_str] = key
>>> new_d = dict((val, d[val]) for val in reverse_d.itervalues())
senderle
  • 145,869
  • 36
  • 209
  • 233
4
input_raw = {112762853378:  {'dst': ['10.121.4.136'],
                             'src': ['1.2.3.4'],
                             'alias': ['www.example.com']    },
             112762853385:  {'dst': ['10.121.4.136'],
                             'src': ['1.2.3.4'],
                             'alias': ['www.example.com']    },
             112760496444:  {'dst': ['10.121.4.299'],
                             'src': ['1.2.3.4']    },
             112760496502:  {'dst': ['10.122.195.34'],
                             'src': ['4.3.2.1']    },
             112758601487:  {'src': ['1.2.3.4'],
                             'alias': ['www.example.com'],
                             'dst': ['10.121.4.136']},
             112757412898:  {'dst': ['10.122.195.34'],
                             'src': ['4.3.2.1']    },
             112757354733:  {'dst': ['124.12.13.14'],
                             'src': ['8.5.6.0']},             
             }

for x in input_raw.iteritems():
    print x
print '\n---------------------------\n'

seen = []

for k,val in input_raw.items():
    if val in seen:
        del input_raw[k]
    else:
        seen.append(val)


for x in input_raw.iteritems():
    print x

result

(112762853385L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})
(112757354733L, {'src': ['8.5.6.0'], 'dst': ['124.12.13.14']})
(112758601487L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})
(112757412898L, {'src': ['4.3.2.1'], 'dst': ['10.122.195.34']})
(112760496502L, {'src': ['4.3.2.1'], 'dst': ['10.122.195.34']})
(112760496444L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.299']})
(112762853378L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})

---------------------------

(112762853385L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.136'], 'alias': ['www.example.com']})
(112757354733L, {'src': ['8.5.6.0'], 'dst': ['124.12.13.14']})
(112757412898L, {'src': ['4.3.2.1'], 'dst': ['10.122.195.34']})
(112760496444L, {'src': ['1.2.3.4'], 'dst': ['10.121.4.299']})

The facts that this solution creates first a list input_raw.iteritems() (as in Andrew's Cox's answer) and requires a growing list seen are drawbacks.
But the first can't be avoided (using iteritems() doesn't work) and the second is less heavy than re-creating a list result.values() from growing list result for each turn of a loop.

eyquem
  • 26,771
  • 7
  • 38
  • 46
3

Another reverse dict variation:

>>> import pprint
>>> 
>>> data = {
...   112762853378: 
...    {'dst': ['10.121.4.136'], 
...     'src': ['1.2.3.4'], 
...     'alias': ['www.example.com']
...    },
...  112762853385: 
...    {'dst': ['10.121.4.136'], 
...     'src': ['1.2.3.4'], 
...     'alias': ['www.example.com']
...    },
...  112760496444: 
...    {'dst': ['10.121.4.136'], 
...     'src': ['1.2.3.4']
...    },
...  112760496502: 
...    {'dst': ['10.122.195.34'], 
...     'src': ['4.3.2.1']
...    },
... }
>>> 
>>> keep = set({repr(sorted(value.items())):key
...             for key,value in data.iteritems()}.values())
>>> 
>>> for key in data.keys():
...     if key not in keep:
...         del data[key]
... 
>>> 
>>> pprint.pprint(data)
{112760496444L: {'dst': ['10.121.4.136'], 'src': ['1.2.3.4']},
 112760496502L: {'dst': ['10.122.195.34'], 'src': ['4.3.2.1']},
 112762853378L: {'alias': ['www.example.com'],
                 'dst': ['10.121.4.136'],
                 'src': ['1.2.3.4']}}
ekhumoro
  • 115,249
  • 20
  • 229
  • 336
  • 3
    Fine but complicated in my opinion – eyquem Jan 06 '12 at 01:24
  • Seems like this would count `{'src':['1.2.3.4'], 'dst':['10.121.3.1236']}` and `{'src':['10.121.3.1236'], 'dst':['1.2.3.4']}` as duplicates of one another... – senderle Jan 06 '12 at 07:30
  • @senderle. Well spotted! Fixed that now, FWIW. I should probably also point out that this solution, although compact, is pretty inefficient compared to some of the others. – ekhumoro Jan 06 '12 at 13:20
2

Since the way to find uniqueness in correspondences is exactly to use a dictionary, with the desired unique value being the key, the way to go is to create a reversed dict, where your values are composed as the key - then recreate a "de-reversed" dictionary using the intermediate result.

dct = {112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   },
   }

def remove_dups (dct):
    reversed_dct = {}
    for key, val in dct.items():
        new_key = tuple(val["dst"]) + tuple(val["src"]) + (tuple(val["alias"]) if "alias" in val else (None,) ) 
        reversed_dct[new_key] = key
    result_dct = {}
    for key, val in reversed_dct.items():
        result_dct[val] = dct[val]
    return result_dct

result = remove_dups(dct)
jsbueno
  • 99,910
  • 10
  • 151
  • 209
2
dups={}

for key,val in dct.iteritems():
    if val.get('alias') != None:
        ref = "%s%s%s" % (val['dst'] , val['src'] ,val['alias'])# a simple hash
        dups.setdefault(ref,[]) 
        dups[ref].append(key)

for k,v in dups.iteritems():
    if len(v) > 1:
        for key in v:
            del dct[key]
joel3000
  • 1,249
  • 11
  • 22
1

I solved it using compressed dictionary method:

dic = {112762853378: 
    {'dst': ['10.121.4.136'], 
     'src': ['1.2.3.4'], 
     'alias': ['www.example.com']
    },
112762853385: 
    {'dst': ['10.121.4.136'], 
     'src': ['1.2.3.4'], 
     'alias': ['www.example.com']
    },
112760496444: 
    {'dst': ['10.121.4.136'], 
     'src': ['1.2.3.4']
    },
112760496502: 
    {'dst': ['10.122.195.34'], 
     'src': ['4.3.2.1']
    }
}

result = {k:v for k,v in dic.items() if list(dic.values()).count(v)==1}
kaya3
  • 47,440
  • 4
  • 68
  • 97
1
from collections import defaultdict

dups = defaultdict(lambda : defaultdict(list))

for key, entry in data.iteritems():
    dups[tuple(entry.keys())][tuple([v[0] for v in entry.values()])].append(key)

for dup_indexes in dups.values():
    for keys in dup_indexes.values():
        for key in keys[1:]:
            if key in data:
                del data[key]
reclosedev
  • 9,352
  • 34
  • 51
0

I would just make a set of the list of keys then iterate over them into a new dict:

input_raw = {112762853378: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112762853385: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4'], 
    'alias': ['www.example.com']
   },
 112760496444: 
   {'dst': ['10.121.4.136'], 
    'src': ['1.2.3.4']
   },
 112760496502: 
   {'dst': ['10.122.195.34'], 
    'src': ['4.3.2.1']
   }
}

filter = list(set(list(input_raw.keys())))

fixedlist = {}

for i in filter:
    fixedlist[i] = logins[i]
-1

You can use

set(dictionary) 

to solve your problem.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Joao Vitor Deon
  • 117
  • 1
  • 5
-3
example = {
    'id1':  {'name': 'jay','age':22,},
    'id2': {'name': 'salman','age': 52,},
    'id3': {'name':'Ranveer','age' :26,},
    'id4': {'name': 'jay', 'age': 22,},
}
for item in example:
    for value in example:
        if example[item] ==example[value]:
            if item != value:
                 key = value 
                 del example[key]
print "example",example         
Gianluca
  • 3,227
  • 2
  • 34
  • 35
  • 1
    Please format your answer with the `{}` button, format matters in Python. And it is a very bad idea to modify lists or dictionaries, while iterating over them. Very bad. – Mr. T Jan 31 '18 at 12:44
  • Welcome to StackOverflow: if you post code, XML or data samples, please highlight those lines in the text editor and click on the "code samples" button ( { } ) on the editor toolbar or using Ctrl+K on your keyboard to nicely format and syntax highlight it! – WhatsThePoint Jan 31 '18 at 12:58
  • ... and nested `for` loops in any scripting language is a bad idea... It is also time to move on from python 2.x – mirekphd Apr 20 '22 at 15:25