Decided to delete and ask again, was just easier! Please do not vote down as have taken on board what people have been saying.
I have two nested dictionaries:-
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
The first dictionary links words a file number and the number of times they appear in that file. The second contains searches linking a word to the number of times it appears in the current search.
I want to extract certain values so that for each search I can calculate the scalar product between the number of times words appear in a file and number of times they appear in a search divided by their magnitudes, then see which file is most similar to the current search i.e. (word 1 appearances in search * word 1 appearances in file) + (word 2 appearances in search * word 2 appearances in file) etc. And then return a dictionary of searches to list of file numbers, most similar first, least similar last.
Expected output is a dictionary:
{1:[4,3,1,2],2:[1,2,4,3]}
etc.
The key is the search number, the value is a list of files most relevant first.
(These may not actually be right.)
This is what I have:-
def retrieve():
results = {}
for word in search:
numberOfAppearances = wordFrequency.get(word).values()
for appearances in numberOfAppearances:
results[fileNumber] = numberOfAppearances.dot()
return sorted (results.iteritems(), key=lambda (fileNumber, appearances): appearances, reverse=True)
Sorry no it just says wdir = and then the directory the .py file is in.
- Edit
The entire Retrieve.py file:
from collections import Counter
def retrieve():
wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog': {1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}
results = {}
for search_number, words in search.iteritems():
file_relevancy = Counter()
for word, num_appearances in words.iteritems():
for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
file_relevancy[file_id] += num_appearances * appear_in_file
results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]
return results
I am using the Spyder GUI / IDE for Anaconda Python 2.7, just press the green play button and output is:
wdir='/Users/danny/Desktop'
- Edit 2
In regards to the magnitude, for example, for search number 3 and file 1 it would be:
sqrt (2^2 + 3^2 + 0^2) * sqrt (3^2 + 0^2 + 3^2)