0

I have been trying to implement the Rocchio algorithm and I understand the basic idea behind the algorithm but I struggle to put it into concrete terms. I calculated tf_idf before and that is a vector of length of the number of query terms we search for each document that contains at least one of the query terms. But now, I feel like I cannot represent the document as a vector in the space formed just by the query terms because that will not allow me to "discover" other terms that the relevant documents have in common. Should I then represent the vector of the query and vectors of the documents in a vector space of all the tokens found in the currently returned set of documents?

vcucu
  • 184
  • 3
  • 12
  • yes the dimension of the vectors (both docs and queries) is the vocabulary size of the collection... so these vectors are extremely sparse (most entries being zeroes)... – Debasis Mar 18 '20 at 09:18

1 Answers1

0

Blockquote yes the dimension of the vectors (both docs and queries) is the vocabulary size of the collection... so these vectors are extremely sparse (most entries being zeroes)...

Yes, as @Debasis said this was the correct answer.

vcucu
  • 184
  • 3
  • 12