10

I have a list of dict where a particular value is repeated multiple times, and I would like to remove the duplicate values.

My list:

te = [
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      }
    ]

function to remove duplicate values:

def removeduplicate(it):
    seen = set()
    for x in it:
        if x not in seen:
            yield x
            seen.add(x)

When I call this function I get generator object.

<generator object removeduplicate at 0x0170B6E8>

When I try to iterate over the generator I get TypeError: unhashable type: 'dict'

Is there a way to remove the duplicate values or to iterate over the generator

Tony Roczz
  • 2,366
  • 6
  • 32
  • 59
  • You cannot add a dictionary to a set, for things to be added to a set they must be hashable. – Panda Nov 27 '15 at 10:35
  • 3
    As a side note: this is not a "list of JSON objects", it's a list of dicts. __There's no such thing as a JSON object__ - JSON is a text format, not a type of objects... – bruno desthuilliers Nov 27 '15 at 10:51

4 Answers4

36

You can easily remove duplicate keys by dictionary comprehension, since dictionary does not allow duplicate keys, as below-

te = [
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
          "Name": "Bala1",
          "phone": "None"
      }      
    ]

unique = { each['Name'] : each for each in te }.values()

print unique

Output-

[{'phone': 'None', 'Name': 'Bala1'}, {'phone': 'None', 'Name': 'Bala'}]
Learner
  • 5,192
  • 1
  • 24
  • 36
  • Really nice, I'll keep that in my backpocket. OTOH please note this is not exactly the same as the OP's function as he's checking the full dict, in your case you'll discard any dict that has the same Name, whenever different or not. – Thomas Guyot-Sionnest Nov 27 '15 at 10:43
  • 1
    Actually, after testing, this would be more like it: `unique = { repr(each): each for each in te }.values()` – Thomas Guyot-Sionnest Nov 27 '15 at 10:48
  • 1
    The OP has accepted it, but I am not sure that this answer is correct considering that it replaces (from list `te`) previous dicts with later dicts, i.e. it loses data. E.g. if `te` contained another dict `{'Name': 'Bala', 'phone': '1234'}`, only the last item in `te` with name `Bala` will be retained. – mhawke Nov 27 '15 at 11:22
7

Because you can't add a dict to set. From this question:

You're trying to use a dict as a key to another dict or in a set. That does not work because the keys have to be hashable.

As a general rule, only immutable objects (strings, integers, floats, frozensets, tuples of immutables) are hashable (though exceptions are possible).

>>> foo = dict()
>>> bar = set()
>>> bar.add(foo)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>> 

Instead, you're already using if x not in seen, so just use a list:

>>> te = [
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       }
...     ]

>>> def removeduplicate(it):
...     seen = []
...     for x in it:
...         if x not in seen:
...             yield x
...             seen.append(x)

>>> removeduplicate(te)
<generator object removeduplicate at 0x7f3578c71ca8>

>>> list(removeduplicate(te))
[{'phone': 'None', 'Name': 'Bala'}]
>>> 
Community
  • 1
  • 1
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
1

You can still use a set for duplicate detection, you just need to convert the dictionary into something hashable such as a tuple. Your dictionaries can be converted to tuples by tuple(d.items()) where d is a dictionary. Applying that to your generator function:

def removeduplicate(it):
    seen = set()
    for x in it:
        t = tuple(x.items())
        if t not in seen:
            yield x
            seen.add(t)

>>> for d in removeduplicate(te):
...    print(d)
{'phone': 'None', 'Name': 'Bala'}

>>> te.append({'Name': 'Bala', 'phone': '1234567890'})
>>> te.append({'Name': 'Someone', 'phone': '1234567890'})

>>> for d in removeduplicate(te):
...    print(d)
{'phone': 'None', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Someone'}

This provides faster lookup (avg. O(1)) than a "seen" list (O(n)). Whether it is worth the extra computation of converting every dict into a tuple depends on the number of dictionaries that you have and how many duplicates there are. If there are a lot of duplicates, a "seen" list will grow quite large, and testing whether a dict has already been seen could become an expensive operation. This might justify the tuple conversion - you would have to test/profile it.

mhawke
  • 84,695
  • 9
  • 117
  • 138
1

I just use md5 to compare everything.

filtered_json = []
md5_list = []

for item in json_fin:
    md5_result = hashlib.md5(json.dumps(item, separators=(',', ':')).encode("utf-8")).hexdigest()
    if md5_result not in md5_list:
        md5_list.append(md5_result)
        filtered_json.append(item)
Benny
  • 695
  • 2
  • 6
  • 19