0

I have a requirement of ranking keywords in a document. I have only 1 document, so I dont know how much TF-IDF would help. I would like to rank the keywords based on their proximity and relevance to the document, I would like to know if I could use term vector for this, and if yes How?

Thanks

Yogi
  • 1,035
  • 2
  • 13
  • 39
  • If you have only one document, tf-idf will not, in general, help. The only way to find important terms is to understand the discourse in the document. And that is not an easy thing to do (unless you want to implement a bunch of state-of-the-art research methods). – Chthonic Project Feb 01 '14 at 16:14
  • Where are you storing the documents ? SQL Server 2012 now has a Semantic Index that can parse different document types. – SteveB Feb 10 '14 at 10:25

1 Answers1

0

In general, to obtain the "proximity" between several document using terms or terms using several documents, you could use the Latent Semantic space --- Look up Latent Semantic Analysis here.

However given that you only have 1 document, you cannot do that, because you have no reference. It's like trying to find how many standard deviations away from the mean a value of interest is located, but you only have a single number. One way to solve this, is to obtain new data, so if the topics of your document isn't too obscure then you could try scraping this information off the internet.

If this isn't what you are looking for, perhaps you may like to explain the problem more specifically with your desired outcome rather than the method that you think may be applicable here.

Cheers

IVR
  • 1,718
  • 2
  • 23
  • 41