How do I compare two lists of dictionaries?

Question

Working on a problem comparing two lists of dictionaries,

a = [{"colA":"red", "colB":"red", "colC":1},{"colA":"grape", "colB":"orange", "colC":4},{"colA":"tan", "colB":"mustard", "colC":3}]  
b =  [{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1, "colD": 3}]

what's an efficient way to compare the two lists to see how many dictionaries in "a" match dictionaries in "b"? (I might have 1 million dictionaries in the list)

2.) I want to check for one list, how many duplicate dictionaries there are within that one list?

using "in" isn't efficient for large lists, for my case I might have 1 million dictionaries within that list, but I would consider it if I can't find other options — Just an engineer, Aug 18 '22 at 14:29
@Justanengineer one way is to convert both to dataframe and then look for rows that are common between two — Himanshu Poddar, Aug 18 '22 at 14:37
I would like to use plain python for this, not sure if there's a more optimized way — Just an engineer, Aug 18 '22 at 14:39
If a dictionary is duplicated in `a` or `b` do you need to know how many times a duplicated dict appears in the other list (or the opposite)? — Ben, Aug 18 '22 at 14:41
Maybe this can help. https://stackoverflow.com/questions/9845369/comparing-2-lists-consisting-of-dictionaries-with-unique-keys-in-python — Albert_coding, Aug 18 '22 at 14:43
for #2, I just want to know how many times a duplicated dict appears in it's own list — Just an engineer, Aug 18 '22 at 14:44
With such a large list you may need some _actual context_ to make it efficient. For example, can you retain some _order_ in that list, such that you could use [bisection](https://docs.python.org/3/library/bisect.html) to find a given element (or its absence) more quickly? — jonrsharpe, Aug 18 '22 at 14:44
so do you want exact match or partial match, eg if a dicitonary in a has 3 key and dictionary in b has 4 key but all 3 mkeys matches in b, then do you consider that one ? — sahasrara62, Aug 18 '22 at 14:48
I would want both as separate answers if that makes sense, for my project I have to do partial match above a given % as well as a perfect match — Just an engineer, Aug 18 '22 at 14:51

kmontocam · Answer 1 · 2022-08-18T15:16:51.580

Python sets are a feasible way to solve this problem. Convert each list of dictionaries into a Python set formed by tuples (has to be tuples, since sets can't unhash the dict_items object Python creates when applying the function items() to a dictionary)

set_a = {tuple(dict_.items()) for dict_ in a}
set_b = {tuple(dict_.items()) for dict_ in b}

To see the dictionaries of a that are in b (dictionaries in the form of a tuple of tuples):

set_a.intersection(set_b)

To check how many duplicates are within one list:

len(a) - len(set_a)

Sets do not store repeated entries, if there is any repeated item in a, the difference is going to be greater than 0

I guess `tuple(sorted(dict_.items()))` would be the way to make this invariant to dict key order. At quite some computational cost though. — jez, Aug 18 '22 at 17:29

score 1 · Answer 2 · answered Aug 18 '22 at 15:06

maybe try this, this is for exact match, for partial match you need to modify the dictionary matching function

a = [{"colA":"red", "colB":"red", "colC":1},{"colA":"grape", "colB":"orange", "colC":4},{"colA":"tan", "colB":"mustard", "colC":3}]  
b =  [{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1, "colD": 3}]

modified_a = {}


def modifiy(data):
    result = {}
    for i in data:
        key = sorted(i.keys())
        values = []
        for k in key:
            values.extend([k, i[k]])
        values = tuple(values)
        print(values)
        if values not in result:
            result[values]=0
        
        result[values]+=1
    return result


modified_a = modifiy(a)
modified_b =modifiy(b)

common = sum(min(modified_a[i], modified_b[i]) for i in modified_a if i in modified_b)
print(common)

score 0 · Answer 3 · answered Aug 18 '22 at 14:45

Based on the information given, here's an answer (albeit primitive) that I put together.

a = [
        { "colA": "red", "colB": "red", "colC": 1 },
        { "colA": "grape", "colB": "orange", "colC": 4 },
        { "colA": "tan", "colB": "mustard", "colC": 3 }
    ]  
b =  [
        { "colA": "red", "colB": "red", "colC": 1 },
        { "colA": "red", "colB": "red", "colC": 1 },
        { "colA": "red", "colB": "red", "colC": 1, "colD": 3}
    ]

a_to_b_matches: list = []
for entry in a:
    if(entry in b):
        a_to_b_matches.append(entry)

a_list_dict_duplicates: list = []
a_temp: list = []
for entry in a:
    if(entry in a_temp):
        a_list_dict_duplicates.append(entry)
    else:
        a_temp.append(entry)

b_list_dict_duplicates: list = []
b_temp: list = []
for entry in b:
    if(entry in b_temp):
        b_list_dict_duplicates.append(entry)
    else:
        b_temp.append(entry)

using `in` for searching a element in a `list` is O(N) operation so here for getting `a_to_b_matches` your code is taking O(N*M) time, where N is size of a, M is size of b — sahasrara62, Aug 18 '22 at 14:52

score 0 · Answer 4 · answered Aug 18 '22 at 14:54

0

I think if your data is extremely huge, using pandas is a good idea:

df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
cols = list(set(df_a.columns.values) & set(df_b.columns.values))
df_a[cols].apply(tuple, axis=1).isin(df_b[cols].apply(tuple, axis=1))

answered Aug 18 '22 at 14:54

MoRe

2,296
2
3
23

if I do use pandas, and something is null and comes as NAN, it turns the value into a float which I can't have for my problem – Just an engineer Aug 18 '22 at 14:56
what if order of keys is different ? this will fail – sahasrara62 Aug 18 '22 at 15:00

score 0 · Answer 5 · answered Aug 18 '22 at 15:22

def to_list_of_lists(lst):
    return list(map(lambda el:[el], lst))

a = [{"colA":"red", "colB":"red", "colC":1},
     {"colA":"grape", "colB":"orange", "colC":4}, 
     {"colA":"tan", "colB":"mustard", "colC":3}]  

b =  [{"colA":"red", "colB":"red", "colC":1},
      {"colA":"red", "colB":"red", "colC":1},
      {"colA":"red", "colB":"red", "colC":1, "colD": 3}] 

a_list_of_lists = to_list_of_lists(a)
b_list_of_lists = to_list_of_lists(b)

Find duplicated items in list b:

duplicates = [i for i in a_list_of_lists if i in b_list_of_lists]
print(duplicates)

Output:

[[{'colA': 'red', 'colB': 'red', 'colC': 1}]]

Find the number of occurrences of those duplicated dictionaries:

occurrences = {}
for i in range(len(duplicates)):
    occurrences[str(duplicates[i])] =  b_list_of_lists.count(duplicates[i])
    
print(occurrences)

Output: The following dictionary is duplicated two times in list b

{"[{'colA': 'red', 'colB': 'red', 'colC': 1}]": 2}

How do I compare two lists of dictionaries?

5 Answers5