2

Decided to delete and ask again, was just easier! Please do not vote down as have taken on board what people have been saying.

I have two nested dictionaries:-

wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}

search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}

The first dictionary links words a file number and the number of times they appear in that file. The second contains searches linking a word to the number of times it appears in the current search.

I want to extract certain values so that for each search I can calculate the scalar product between the number of times words appear in a file and number of times they appear in a search divided by their magnitudes, then see which file is most similar to the current search i.e. (word 1 appearances in search * word 1 appearances in file) + (word 2 appearances in search * word 2 appearances in file) etc. And then return a dictionary of searches to list of file numbers, most similar first, least similar last.

Expected output is a dictionary:

{1:[4,3,1,2],2:[1,2,4,3]}

etc.

The key is the search number, the value is a list of files most relevant first.

(These may not actually be right.)

This is what I have:-

def retrieve():
    results = {}
    for word in search:
        numberOfAppearances = wordFrequency.get(word).values()
        for appearances in numberOfAppearances:
            results[fileNumber] = numberOfAppearances.dot()
return sorted (results.iteritems(), key=lambda (fileNumber, appearances): appearances, reverse=True)

Sorry no it just says wdir = and then the directory the .py file is in.

  • Edit

The entire Retrieve.py file:

from collections import Counter

def retrieve():

    wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':    {1:3,2:0,3:4,4:5}}
    search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}


    results = {}
    for search_number, words in search.iteritems():
        file_relevancy = Counter()
        for word, num_appearances in words.iteritems():
            for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
                file_relevancy[file_id] += num_appearances * appear_in_file

        results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]

    return results

I am using the Spyder GUI / IDE for Anaconda Python 2.7, just press the green play button and output is:

wdir='/Users/danny/Desktop'

  • Edit 2

In regards to the magnitude, for example, for search number 3 and file 1 it would be:

sqrt (2^2 + 3^2 + 0^2) * sqrt (3^2 + 0^2 + 3^2)

DannyBoy
  • 77
  • 1
  • 1
  • 7

1 Answers1

0

Here is a start:

from collections import Counter
def retrieve():
    results = {}
    for search_number, words in search.iteritems():
        file_relevancy = Counter()
        for word, num_appearances in words.iteritems():
            for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
                file_relevancy[file_id] += num_appearances * appear_in_file

        results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]

    return results

print retrieve()
Tzach
  • 12,889
  • 11
  • 68
  • 115
  • I still do not get any output, only wdir = again? – DannyBoy Nov 20 '14 at 22:54
  • I tested this code and it works. What is `wdir =`? Did you print the results? – Tzach Nov 20 '14 at 22:56
  • It just says wdir = '' (the file path is in the quotations). Print the results? – DannyBoy Nov 20 '14 at 23:01
  • How exactly are you running the script? Can you copy the entire file contents to the question? – Tzach Nov 20 '14 at 23:02
  • Shown in original question :) – DannyBoy Nov 20 '14 at 23:08
  • Try to run the file without spyder first, just from the console by typing `python ` – Tzach Nov 20 '14 at 23:15
  • It doesn't do anything? – DannyBoy Nov 20 '14 at 23:20
  • You should run the functiom, just add `print retrieve()` to the end of the file. – Tzach Nov 20 '14 at 23:22
  • Yeah it works now, that's amazing thank you so much! One more thing though. In order to calculate vector similarity the dot product needs to be divided by the product of the magnitudes of the search and file vectors, if that makes sense? I realise line 8 is taking the dot product but it doesn't appear to be dividing by the product of the 2 magnitudes? – DannyBoy Nov 20 '14 at 23:46
  • That's why I wrote "Here is a start". You should continue to improve my code until you get exactly what you wanted. – Tzach Nov 20 '14 at 23:48
  • Okay cool, but does that make sense to you what I mean? I made an example in the original question above?! – DannyBoy Nov 21 '14 at 00:06