Document Query similarity for very short documents

Question

I am working on a project which incorporates a basic implementation of the vector space model. A collection of documents d1...dn form the columns of the term document matrix, the rows represent the words in the collection. I use standard tf-idf scoring with cosine similarity to calculate the distance between a query and a document.

My question is, which distance metric can "tackle" similarity between short documents. Example: A document containing a single word, which is part of the query, will score very high using cosine similarity, since the norm of such a document is very small. How can I "punish" such documents which are obviously irrelevant?

question: Should a single word document be considered as a document? if so why? question again: how big is your data set and how many % of them are single word/"short" documents? question yet again: if i have 2 documents one says "the dog" and the other say "a canine"? Should they be similar in your document similarity task? — alvas, Jul 10 '13 at 07:35
Answers: 1.) A single word document is still considered a document. The reason is that a document in my context is actually a webpage, which has other "features" besides raw html text. 2.) The dataset includes a few thousands of documents, ~10% are short. 3.) Words such as "dog" and "canine" need not be similar for my application, although this would be nice. I believe such lexical connections could be taken into account using WordNet, although regarding "web" context there is a lot of slang so this is another completely different problem in my opinion. — Leeor, Jul 10 '13 at 07:53
can you give a few examples of the short documents in your dataset? — alvas, Jul 10 '13 at 08:04
it should penalize both examples and not (or less so) long pages (such as a blog) — Leeor, Jul 10 '13 at 09:07
Do you really need a distance function (doc1, doc2) => dist ? Or would a classifier (doc) => cluster_of_docs be sufficient for your app ? — Blacksad, Jul 10 '13 at 17:23

Document Query similarity for very short documents

0 Answers0