Different search models for simple search engine program Binary Search

Question

This is similar but not the same as a previous question. Rather it builds upon it (although technically it is easier).

I'll link it here so you can get an idea:-

Still using my two nested dictionaries:-

wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}
search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}

The first dictionary links words a file number and the number of times they appear in that file. The second contains searches linking a word to the number of times it appears in the current search.

I want to use the Binary Independence Model, a good explanation is here:-

http://en.wikipedia.org/wiki/Binary_Independence_Model

It is simpler than my previous model because the number of times a particular word appears in a search or file is irrelevant, only the presence or absence is important i.e. it is boolean. Therefore it is similar, but if a word appears more than once in a search or file it is still treated as just 1.

The expected output is again a dictionary:-

{1: [3, 2, 1, 4], 2: [3, 4, 1, 2], 3: [3, 2, 1, 4]}

This is the output from the previous, vector space model program, so this output may be different.

My code so far:-

from collections import Counter

def retrieve():

    wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}}
    search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}}

    results = {}
    for search_number, words in search.iteritems():
        file_relevancy = Counter()
        for word, num_appearances in words.iteritems():
            num_appearances = 1
            for file_id, appear_in_file in wordFrequency.get(word, {}).iteritems():
                appear_in_file = 1
                file_relevancy[file_id] += num_appearances * appear_in_file

        results[search_number] = [file_id for (file_id, count) in file_relevancy.most_common()]

    return results

print retrieve

However, I am only getting an output of

{1: [1, 2, 3, 4], 2: [1, 2, 3, 4], 3: [1, 2, 3, 4]}

This is not correct, it is just returning the files in numeric order?

Oh come on, how is this question worth a down vote?! At least say why! — DannyBoy, Nov 22 '14 at 16:32
Well I've just come to it now, but I don't yet understand what you're asking. You say what your expected output is, but I have no idea why that's your expected output. And indeed, you say "this output may be different" -- why? How? Then I tried to understand your code, but I don't know what `search` is. I don't know what `wordFrequency` is. This is a very confusing question! — senderle, Nov 22 '14 at 16:44
My suggestion is to break this down into smaller parts. It's short, but it's not simple; your code is nested four levels deep. Linus Torvalds, in the [style guide](https://www.kernel.org/doc/Documentation/CodingStyle) for the _linux kernel_, says "if you need more than 3 levels of indentation, you're screwed." That's for an operating system kernel! Simplify this code. Write individual functions for each of these loops, and test them to make sure they do what you expect. Then combine them and test the combinations. — senderle, Nov 22 '14 at 16:47
Ah, I see what `search` and `wordFrequency` are now -- OK. But it took me some time to put that together because the question isn't well organized. — senderle, Nov 22 '14 at 16:50
I see what the problem is now. You're missing some `if` statements. I won't say any more because you really should do what I said above. Otherwise you'll run into problems like these time and time again. — senderle, Nov 22 '14 at 17:02
Sorry but I do not see how if statements can help this to work? — DannyBoy, Nov 22 '14 at 17:48
Yes, but I'd ask you to try breaking this into smaller pieces and posting the resulting code first. I bet you'll be able to answer your own question in the process! — senderle, Nov 22 '14 at 19:03
I'm not sure how to break it up though? It is based off Tzach's response to my previous question and that uses 3 nested for and it makes sense to me now as well? I would like to keep the coding style consistent. — DannyBoy, Nov 22 '14 at 20:20
Write several small functions with one loop each. They should take the relevant input, create a list with the correct output, and return it. If you can't figure out how to do that, then you need to practice with a simpler problem than this. — senderle, Nov 22 '14 at 23:44

Different search models for simple search engine program Binary Search

0 Answers0