I have been trying to implement the Rocchio algorithm and I understand the basic idea behind the algorithm but I struggle to put it into concrete terms. I calculated tf_idf before and that is a vector of length of the number of query terms we search for each document that contains at least one of the query terms. But now, I feel like I cannot represent the document as a vector in the space formed just by the query terms because that will not allow me to "discover" other terms that the relevant documents have in common. Should I then represent the vector of the query and vectors of the documents in a vector space of all the tokens found in the currently returned set of documents?
Asked
Active
Viewed 103 times
0
-
yes the dimension of the vectors (both docs and queries) is the vocabulary size of the collection... so these vectors are extremely sparse (most entries being zeroes)... – Debasis Mar 18 '20 at 09:18
1 Answers
0
Blockquote yes the dimension of the vectors (both docs and queries) is the vocabulary size of the collection... so these vectors are extremely sparse (most entries being zeroes)...
Yes, as @Debasis said this was the correct answer.

vcucu
- 184
- 3
- 12