Right way to calculate the cosine similarity of two word-frequency-dictionaries in python?

Question

I'm trying to iterate through a file containing text and calculate the cosine similarity between the current line and a query the user raised. I have already tokenized the query and the line and saved the union of their words into a set.

Example:

line_tokenized = ['Karl', 'Donald', 'Ifwerson']

query_tokenized = ['Donald', 'Trump']

word_set = ['Karl', 'Donald', 'Ifwerson', 'Trump']

Now I have to create a dictionary each for the line and the query, containing word-frequency pairs. I thought about something ike this:

line_dict = {'Karl': 1, 'Donald': 1, 'Ifwerson': 1, 'Trump': 0}
query_dict = {'Karl': 0, 'Donald': 1, 'Ifwerson': 0, 'Trump': 1}

But the cosine similarity won't be calculated properly as the key-value pairs are unordered. I came across OrderedDict(), but I don't understand how to implement some things as it's elements are stored as tuples:

So my questions are:

How can I set the key-value pairs and have access to them afterwards?
How can I increment the value of a certain key?
Or is there any other more easier way to do this?

What do you mean by "key-value pairs are unordered"? How would you expect them to be ordered? — bluesummers, Jan 24 '17 at 12:17
I'd like them to stay in the order I add them to the dictionary. — lvcasco, Jan 24 '17 at 12:19
I don't understand what you mean by that, I don't see in your code which order you refer to, I have a good answer for you, just explain to me the order — bluesummers, Jan 24 '17 at 12:20
@LucaIonescu: a dictionary is a hashtable and thus has no inherent order. You can indeed use datastructures to enforce order, but to calculate the cos sim, that is not necessary. — Willem Van Onsem, Jan 24 '17 at 12:24

score 3 · Accepted Answer · edited Dec 10 '20 at 12:17

You do not need to order the dictionary for Cosine similarity, simple lookup is sufficient:

import math

def cosine_dic(dic1,dic2):
    numerator = 0
    dena = 0
    for key1,val1 in dic1.items():
        numerator += val1*dic2.get(key1,0.0)
        dena += val1*val1
    denb = 0
    for val2 in dic2.values():
        denb += val2*val2
    return numerator/math.sqrt(dena*denb)

you simply use a .get(key1,0.0) to lookup of the element exists and if it does not 0.0 is assumed. As a result both dic1 and dic2 do not need to store values with 0 as value.

To answer your additional questions:

How can I set the key-value pairs and have access to them afterwards?

You simply state:

dic[key] = value

How can I increment the value of a certain key?

If you know for sure that the key is already part of the dictionary:

dic[key] +=  1

otherwise you can use:

dic[key] = dic.get(key,0)+1

Or is there any other more easier way to do this?

You can use a Counter which is basically a dictionary with some added functionality.

score 1 · Answer 2 · answered Jan 24 '17 at 12:24

Using pandas and scipy

import pandas as pd
from scipy.spatial.distance import cosine

line_dict = {'Karl': 1, 'Donald': 1, 'Ifwerson': 1, 'Trump': 0}
query_dict = {'Karl': 0, 'Donald': 1, 'Ifwerson': 0, 'Trump': 1}

line_s = pd.Series(line_dict)
query_s = pd.Series(query_dict)

print(1 - cosine(line_s, query_s))

This code will output 0.40824829046386291

I didn't understand what you meant by "order" so I haven't dealt with that, but this code should be a good start for you.

Right way to calculate the cosine similarity of two word-frequency-dictionaries in python?

2 Answers2