0

Hi i'm trying to write a little program that indexes some documents from an xml collection. I use the tf-idf method. Now when my program reads the query it returns a list of tuples ('tf-idf','docid') for each word in each document.

This is an example:

Query: "Dog water"

Documents: [(0.212,1),(0.334,1),(0.111,2),(0,2)]  

in this case the document 2 has only one word inside it.

Now my question is: i know that i have to do the dot product between those documents and the query, but how can i do it? How can i translate the query into a vector of weight?

Thank you.

1 Answers1

0

if your question is: "how do I build a docid: [weight,...] dict from this list, the answer is quite simple:

from collections import defaultdict

def transform(query_results):
     revindex = defaultdict(list)
     for weight, docid in query_results:
         revindex[docid].append(weight)
     return revindex

Else please give more explanations - and if possible an expected output example.

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • no, sorry,i would like to understand how to interpret the query as a list of weight, so then i can do a dot product between this list and the list of documents i returned. Sorry for my english. – Dancing Flowerz Sep 08 '14 at 16:58
  • Please post the expected output. – bruno desthuilliers Sep 09 '14 at 08:49
  • it's not about the output, i need to understand how to interpret the query as a vector of weights to do the dot product with my documents vector, as the cosine similarity – Dancing Flowerz Sep 09 '14 at 12:13
  • If you cannot come with either a clear explanation and/or a *concrete* example of what you want then no one will be able to help you. Note that you already got 3 vote for closing, all for the same reason : "unclear what you are asking". – bruno desthuilliers Sep 09 '14 at 15:23
  • Ok, i try to be more specific. Now when a user writes a query, i return a list of document as i said. this list contains tuples (tf*idf,docid)... now... at this point i could simply sum the weights with the same docid for example (0.2132,4) and (0.33,4). So at the end i would have tuples with different document ids, each one rappresenting the final weight of the document. Instead of this solution i can do the cosine similarity between the list of document i have and the query written by the user. To do this i have to transform the query "word1 word2 word3" in an equivalent list of weights. – Dancing Flowerz Sep 09 '14 at 19:41