Comparing three (or more) dictionaries and finding a match if at least two are equal

Question

I'm faced with an issue similar to this one. However, that SO question is strictly focused on three variables. I am looking for a solution that would work for more than three as well.

Here's my code for two variables:

for track_a in collection_a:
    for track_b in collection_b:

        t1 = track_a["tempo"]
        t2 = track_b["tempo"]
        k1 = track_a["key"]
        k2 = track_b["key"]
        m1 = track_a["mode"]
        m2 = track_b["mode"]

        if (t1 == t2) and (k1 == k2) and (m1 == m2):
            collection_c.append((track_a, track_b))

Here's my solution for three variables:

for track_a in collection_a:
    for track_b in collection_b:
        for track_c in collection_c:

            t1 = track_a["tempo"]
            t2 = track_b["tempo"]
            t3 = track_c["tempo"]
            k1 = track_a["key"]
            k2 = track_b["key"]
            k3 = track_c["key"]
            m1 = track_a["mode"]
            m2 = track_b["mode"]
            m3 = track_c["mode"]

            a = (t1 == t2) and (k1 == k2) and (m1 == m2)
            b = (t2 == t3) and (k2 == k3) and (m2 == m3)
            c = (t3 == t1) and (k3 == k1) and (m3 == m1)

            if a: collection_c.append((track_a, track_b))
            if b: collection_c.append((track_b, track_c))
            if c: collection_c.append((track_c, track_a))

Obviously, this solution is not scalable and slow. Considering the fact I'd have to check all of them, I doubt it will ever be fast since we have to iterate over all possible combinations, but could I at least make it scale? (Up to at least 5). Also, if possible, allow more comparison characteristics to be added later.

do you want to append matches to a completely different list, is my understanding? — gold_cy, Feb 11 '19 at 20:10
I don't see how this problem is similar to the one you linked to. You seem to want to match pairs across _n_ collections, but I think that your triple loop does the same as running your double loop for (a, b), then (b, c) and then (c, a), since you never match all three against each other. — M Oehm, Feb 11 '19 at 20:10
@aws_apprentice Yes. The new list is a combination of matching pairs. — Alex Osheter, Feb 11 '19 at 21:24

blhsing · Accepted Answer · 2019-02-11T21:09:44.900

1

An efficient approach that solves the issue in linear time is to convert the dicts to frozen sets of key-value tuples (over keys that are used for equality tests) so that they can be hashable and used as dict keys (signatures) themselves, and so that you can simply use a dict of sets to group them:

groups = {}
for track in collections: # collections is a combination of all the collections you have
    groups.setdefault(frozenset((k, track[k]) for k in ('tempo', 'key', 'mode')), set()).add(track['name'])

so that:

[group for group in groups.values() if len(group) >= 3]

will return you a list of sets of names of the 3 tracks whose signatures are identical.

edited Feb 11 '19 at 21:09

answered Feb 11 '19 at 20:26

blhsing

91,368
6
71
106

1

This is straight up magic. Is there any resource where I can find more info on the topic? (Why did you decide to use frozen sets, how this idea even came to use the signature as a key, etc). I'm marking this as the correct answer because it's short, sweet, scalable, and offers plenty of information/theory. – Alex Osheter Feb 11 '19 at 22:24
Glad to be of help. It's a common practice to use reasonably unique hash keys to represent complex data structures so that their contents can be efficiently compared to one another. In Python, mutable objects such as dicts and sets are not hashable, while immutable objects such as tuples and frozen sets are, so the workaround is to convert dicts into frozen sets of tuples of key-value pairs so that they can be hashable and used as dict keys. – blhsing Feb 11 '19 at 23:29

score 0 · Answer 2 · answered Feb 11 '19 at 20:23

Here is a logically scalable solution that for n dictionaries being compared on m values will take time n*m to evaluate.

Do note, if three match, I will return a group of 3. It is easy enough to then blow that up to all matching pairs. But if you do so, then you could be returning something of size n*n. I have shown you what both look like.

def group_on(variables, *tracks):
    # Build a trie first.
    trie = {}
    for track in tracks:
        this_path = trie
        for variable in variables:
            value = track[variable]
            if value not in this_path:
                this_path[value] = {}
            this_path = this_path[value]
        if 'final' not in this_path:
            this_path['final'] = [track]
        else:
            this_path['final'].append(track)

    def find_groups(this_path, count):
        if 0 == count:
            if 1 < len(this_path['final']):
                yield this_path['final']
        else:
            for next_path in this_path.values():
                for group in find_groups(next_path, count-1):
                    yield group

    for group in find_groups(trie, len(variables)):
        yield group

def group_to_pairs(group):
    for i in range(len(group)-1):
        for j in range(i+1, len(group)):
            yield (group[i], group[j])

print('Efficient version')

for group in group_on(['tempo', 'key', 'mode'],
        {'track': 1, 'tempo': 1, 'key': 'A', 'mode': 'minor'},
        {'track': 2, 'tempo': 1, 'key': 'A', 'mode': 'major'},
        {'track': 3, 'tempo': 1, 'key': 'A', 'mode': 'minor'},
        {'track': 4, 'tempo': 1, 'key': 'A', 'mode': 'major'},
        {'track': 5, 'tempo': 1, 'key': 'A', 'mode': 'minor'},
        ):
    print(group)

print('Versus')

for group in group_on(['tempo', 'key', 'mode'],
        {'track': 1, 'tempo': 1, 'key': 'A', 'mode': 'minor'},
        {'track': 2, 'tempo': 1, 'key': 'A', 'mode': 'major'},
        {'track': 3, 'tempo': 1, 'key': 'A', 'mode': 'minor'},
        {'track': 4, 'tempo': 1, 'key': 'A', 'mode': 'major'},
        {'track': 5, 'tempo': 1, 'key': 'A', 'mode': 'minor'},
        ):
    for pair in group_to_pairs(group):
        print(pair)

score 0 · Answer 3 · answered Feb 11 '19 at 20:32

Find something useful in itertools, not sure if this is what you want:

from itertools import product, combinations

all_collections = [collection_a, collection_b, collection_c] # d, e, f, ...
for collections in combinations(all_collections, 2):         # Pick 2 (or any number) collections from all collections
    for tracks in product(*collections):                     # Cartesian product of collections or equivalent to for track1 in collection1: for track2 in collection2: ...
        if True:                                             # check if all tracks are matched
            print(*tracks)                                   # or append them to another collection

Comparing three (or more) dictionaries and finding a match if at least two are equal

3 Answers3