I'm trying to iterate through a file containing text and calculate the cosine similarity between the current line and a query the user raised. I have already tokenized the query and the line and saved the union of their words into a set.
Example:
line_tokenized = ['Karl', 'Donald', 'Ifwerson']
query_tokenized = ['Donald', 'Trump']
word_set = ['Karl', 'Donald', 'Ifwerson', 'Trump']
Now I have to create a dictionary each for the line and the query, containing word-frequency pairs. I thought about something ike this:
line_dict = {'Karl': 1, 'Donald': 1, 'Ifwerson': 1, 'Trump': 0}
query_dict = {'Karl': 0, 'Donald': 1, 'Ifwerson': 0, 'Trump': 1}
But the cosine similarity won't be calculated properly as the key-value pairs are unordered. I came across OrderedDict()
, but I don't understand how to implement some things as it's elements are stored as tuples:
So my questions are:
- How can I set the key-value pairs and have access to them afterwards?
- How can I increment the value of a certain key?
- Or is there any other more easier way to do this?