0

I am working on a project which incorporates a basic implementation of the vector space model. A collection of documents d1...dn form the columns of the term document matrix, the rows represent the words in the collection. I use standard tf-idf scoring with cosine similarity to calculate the distance between a query and a document.

My question is, which distance metric can "tackle" similarity between short documents. Example: A document containing a single word, which is part of the query, will score very high using cosine similarity, since the norm of such a document is very small. How can I "punish" such documents which are obviously irrelevant?

Leeor
  • 627
  • 7
  • 24
  • question: Should a single word document be considered as a document? if so why? question again: how big is your data set and how many % of them are single word/"short" documents? question yet again: if i have 2 documents one says "the dog" and the other say "a canine"? Should they be similar in your document similarity task? – alvas Jul 10 '13 at 07:35
  • Answers: 1.) A single word document is still considered a document. The reason is that a document in my context is actually a webpage, which has other "features" besides raw html text. 2.) The dataset includes a few thousands of documents, ~10% are short. 3.) Words such as "dog" and "canine" need not be similar for my application, although this would be nice. I believe such lexical connections could be taken into account using WordNet, although regarding "web" context there is a lot of slang so this is another completely different problem in my opinion. – Leeor Jul 10 '13 at 07:53
  • can you give a few examples of the short documents in your dataset? – alvas Jul 10 '13 at 08:04
  • examples: "this domain is for sale", "parked domain" – Leeor Jul 10 '13 at 08:18
  • should your system penalize "parked domain" in this case? – alvas Jul 10 '13 at 08:53
  • it should penalize both examples and not (or less so) long pages (such as a blog) – Leeor Jul 10 '13 at 09:07
  • Do you really need a distance function (doc1, doc2) => dist ? Or would a classifier (doc) => cluster_of_docs be sufficient for your app ? – Blacksad Jul 10 '13 at 17:23
  • I need a distance function... – Leeor Jul 11 '13 at 04:29

0 Answers0