1

Working on a problem comparing two lists of dictionaries,

a = [{"colA":"red", "colB":"red", "colC":1},{"colA":"grape", "colB":"orange", "colC":4},{"colA":"tan", "colB":"mustard", "colC":3}]  
b =  [{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1, "colD": 3}] 

what's an efficient way to compare the two lists to see how many dictionaries in "a" match dictionaries in "b"? (I might have 1 million dictionaries in the list)

2.) I want to check for one list, how many duplicate dictionaries there are within that one list?

  • How about `for k in a: if a[k] in b: Do Something`? – Fiddling Bits Aug 18 '22 at 14:26
  • using "in" isn't efficient for large lists, for my case I might have 1 million dictionaries within that list, but I would consider it if I can't find other options – Just an engineer Aug 18 '22 at 14:29
  • @Justanengineer one way is to convert both to dataframe and then look for rows that are common between two – Himanshu Poddar Aug 18 '22 at 14:37
  • I would like to use plain python for this, not sure if there's a more optimized way – Just an engineer Aug 18 '22 at 14:39
  • 2
    If a dictionary is duplicated in `a` or `b` do you need to know how many times a duplicated dict appears in the other list (or the opposite)? – Ben Aug 18 '22 at 14:41
  • Maybe this can help. https://stackoverflow.com/questions/9845369/comparing-2-lists-consisting-of-dictionaries-with-unique-keys-in-python – Albert_coding Aug 18 '22 at 14:43
  • for #2, I just want to know how many times a duplicated dict appears in it's own list – Just an engineer Aug 18 '22 at 14:44
  • With such a large list you may need some _actual context_ to make it efficient. For example, can you retain some _order_ in that list, such that you could use [bisection](https://docs.python.org/3/library/bisect.html) to find a given element (or its absence) more quickly? – jonrsharpe Aug 18 '22 at 14:44
  • so do you want exact match or partial match, eg if a dicitonary in a has 3 key and dictionary in b has 4 key but all 3 mkeys matches in b, then do you consider that one ? – sahasrara62 Aug 18 '22 at 14:48
  • I would want both as separate answers if that makes sense, for my project I have to do partial match above a given % as well as a perfect match – Just an engineer Aug 18 '22 at 14:51

5 Answers5

2

Python sets are a feasible way to solve this problem. Convert each list of dictionaries into a Python set formed by tuples (has to be tuples, since sets can't unhash the dict_items object Python creates when applying the function items() to a dictionary)

set_a = {tuple(dict_.items()) for dict_ in a}
set_b = {tuple(dict_.items()) for dict_ in b}

To see the dictionaries of a that are in b (dictionaries in the form of a tuple of tuples):

set_a.intersection(set_b)

To check how many duplicates are within one list:

len(a) - len(set_a)

Sets do not store repeated entries, if there is any repeated item in a, the difference is going to be greater than 0

kmontocam
  • 78
  • 6
  • I guess `tuple(sorted(dict_.items()))` would be the way to make this invariant to dict key order. At quite some computational cost though. – jez Aug 18 '22 at 17:29
1

maybe try this, this is for exact match, for partial match you need to modify the dictionary matching function

a = [{"colA":"red", "colB":"red", "colC":1},{"colA":"grape", "colB":"orange", "colC":4},{"colA":"tan", "colB":"mustard", "colC":3}]  
b =  [{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1},{"colA":"red", "colB":"red", "colC":1, "colD": 3}]

modified_a = {}


def modifiy(data):
    result = {}
    for i in data:
        key = sorted(i.keys())
        values = []
        for k in key:
            values.extend([k, i[k]])
        values = tuple(values)
        print(values)
        if values not in result:
            result[values]=0
        
        result[values]+=1
    return result


modified_a = modifiy(a)
modified_b =modifiy(b)

common = sum(min(modified_a[i], modified_b[i]) for i in modified_a if i in modified_b)
print(common)
sahasrara62
  • 10,069
  • 3
  • 29
  • 44
0

Based on the information given, here's an answer (albeit primitive) that I put together.

a = [
        { "colA": "red", "colB": "red", "colC": 1 },
        { "colA": "grape", "colB": "orange", "colC": 4 },
        { "colA": "tan", "colB": "mustard", "colC": 3 }
    ]  
b =  [
        { "colA": "red", "colB": "red", "colC": 1 },
        { "colA": "red", "colB": "red", "colC": 1 },
        { "colA": "red", "colB": "red", "colC": 1, "colD": 3}
    ]

a_to_b_matches: list = []
for entry in a:
    if(entry in b):
        a_to_b_matches.append(entry)

a_list_dict_duplicates: list = []
a_temp: list = []
for entry in a:
    if(entry in a_temp):
        a_list_dict_duplicates.append(entry)
    else:
        a_temp.append(entry)

b_list_dict_duplicates: list = []
b_temp: list = []
for entry in b:
    if(entry in b_temp):
        b_list_dict_duplicates.append(entry)
    else:
        b_temp.append(entry)
TannerWA
  • 13
  • 2
  • using `in` for searching a element in a `list` is O(N) operation so here for getting `a_to_b_matches` your code is taking O(N*M) time, where N is size of a, M is size of b – sahasrara62 Aug 18 '22 at 14:52
0

I think if your data is extremely huge, using pandas is a good idea:

df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
cols = list(set(df_a.columns.values) & set(df_b.columns.values))
df_a[cols].apply(tuple, axis=1).isin(df_b[cols].apply(tuple, axis=1))
MoRe
  • 2,296
  • 2
  • 3
  • 23
0
def to_list_of_lists(lst):
    return list(map(lambda el:[el], lst))

a = [{"colA":"red", "colB":"red", "colC":1},
     {"colA":"grape", "colB":"orange", "colC":4}, 
     {"colA":"tan", "colB":"mustard", "colC":3}]  

b =  [{"colA":"red", "colB":"red", "colC":1},
      {"colA":"red", "colB":"red", "colC":1},
      {"colA":"red", "colB":"red", "colC":1, "colD": 3}] 

a_list_of_lists = to_list_of_lists(a)
b_list_of_lists = to_list_of_lists(b)

Find duplicated items in list b:

duplicates = [i for i in a_list_of_lists if i in b_list_of_lists]
print(duplicates)

Output:

[[{'colA': 'red', 'colB': 'red', 'colC': 1}]]

Find the number of occurrences of those duplicated dictionaries:

occurrences = {}
for i in range(len(duplicates)):
    occurrences[str(duplicates[i])] =  b_list_of_lists.count(duplicates[i])
    
print(occurrences)

Output: The following dictionary is duplicated two times in list b

{"[{'colA': 'red', 'colB': 'red', 'colC': 1}]": 2}
Ali
  • 350
  • 3
  • 10