0

wikipedia gave a very nice explanation of vector space model.

http://en.wikipedia.org/wiki/Vector_space_model

except it skip one part which is not self explanatory to me. that is the definition of the query vector. The text starts with

d_j = ( w_{1,j} ,w_{2,j} , .... ,w_{t,j} )   // document vector
q = ( w_{1,q} ,w_{2,q} , ... ,w_{t,q} )    // query vector

and proceed to explain how d_j is defined in terms of tf-idf for a document in a corpus. That's all fine, but I am not able to translate that explanation to the query vector. In the idf part, how would you apply

| {d' E D | t E d' }| ? ( I am using E to represent 'member of set'). 

In case of query vector, even though a term is a part of a query, the query itself is not a document in the corpus, so the above normalization term has no equivalent.

any experts in the vector space model able to clarify?

hivert
  • 10,579
  • 3
  • 31
  • 56
bhomass
  • 3,414
  • 8
  • 45
  • 75
  • The more I think about it, it seems the query is simply treated as an addition document. Since the number of documents with a particular query term tends to be high, the addition of one more document makes negligible difference to the normalization term. – bhomass Feb 01 '14 at 22:16

1 Answers1

0

One of the key ideas behind VSM is that we treat both queries and documents simply as "bags of words" that are in the same space. This means that in order to create the query vector we can treat it like a document as well, so the idf of the corpus can be used for that also.

It's important to note that there are various scoring schemes, and the scoring schemes for query vectors doesn't have to match those of document vectors.

Here's a good explanation: http://nlp.stanford.edu/IR-book/html/htmledition/queries-as-vectors-1.html

I think reading the whole chapter 6 is very helpful in understanding VSM, and there are more advanced topics in later chapters if you're interested.

aiguofer
  • 1,887
  • 20
  • 34