Using Python, what is the fastest way to compare two large dictionaries while returning keys (as frozensets) that match past a certain threshold?

Question

Assume I have two very large dictionaries: bigDictA and bigDictB, so something like below.

bigDictA = {frozenset("a1","b2","c3"): [some floats in a list], ...}
bigDictB = {frozenset("a1","b2"): [some floats in a list], ...}

Now, the algorithm I wrote which I need help optimizing looks like this:

setOfMatches = set()
for bigDictAkey, bigDictAval in bigDictA.items():
    for bigDictBkey, bigDictBval in bigDictB.items():
        itemsInCommon = list(frozenset.intersection(bigDictAkey,bigDictBkey))
        numberOfItemsInCommon = len(itemsInCommon)
        valForComparison = THRESHOLD*float(len(list(bigDictAkey)))
        if (numberOfItemsInCommon >= valForComparison):
            setOfMatches.add(bigDictAkey)

So if THRESHOLD was 0.3 frozenset("a1","b2","c3") would be added to setOfMatches, but if THRESHOLD was 0.7 it wouldn't be added.

I realize this isn't efficient, but I am definitely open to any suggestions including converting the key datatype to some other datatype for a speed up. I have also looked into using tools like Numba and Cython as well (I prefer to keep it in pure python though). It needs to be crazy fast!

Any help is really appreciated! Thanks!

So, as pointed out by @blhsing, I could have several immediate improvements (doh!), but I am still open for any further ways to speed this up! — D00dood, Jul 19 '18 at 02:51

score 2 · Accepted Answer · answered Jul 19 '18 at 02:19

2

You aren't using the values of bigDictA and bigDictB, so you don't need to use the dict.items() method.
len() works on sets too, so no need to convert a set to a list in order to get its size.
As soon as there's one match above the threshold and you add bigDictAkey to setOfMatches, there is no need to test the rest of the items in bigDictB and you should immediately break the inner loop to check the next item in bigDictA.

The improved code as follows:

setOfMatches = set()
for bigDictAkey in bigDictA:
    for bigDictBkey in bigDictB:
        numberOfItemsInCommon = len(bigDictAkey & bigDictBkey)
        valForComparison = THRESHOLD*len(bigDictAkey)
        if (numberOfItemsInCommon >= valForComparison):
            setOfMatches.add(bigDictAkey)
            break

answered Jul 19 '18 at 02:19

blhsing

91,368
6
71
106

1

If there happens to be an even quicker way, I will accept that. If no one else comments then you got it! :) Thanks again. So silly of me. – D00dood Jul 19 '18 at 02:24
One thing though, I have to mention. I don't think the break is a good idea, because there may be more matches after one of them is made for a given key from bigDictA, so we should still test the rest of the keys from bigDictB. – D00dood Jul 19 '18 at 02:35
1

But even if there are more matches in the rest of `bigDictB`, you are still going to add the same `bigDictAkey` to the `setOfMatches` set, which already has the current `bigDictAkey`, and that's why you should `break` as soon as you find a match. – blhsing Jul 19 '18 at 02:37
1

Oh I see! Perfect! ( ͡ᵔ ͜ʖ ͡ᵔ ) – D00dood Jul 19 '18 at 02:43

Using Python, what is the fastest way to compare two large dictionaries while returning keys (as frozensets) that match past a certain threshold?

1 Answers1