-1

I want to rank 100 documents based on similarity. For example 10 documents will be similar say (A, A', A'', A''',...) and another set of 10 documents could be similar say (B, B', B'', B''', ...). Now documents should be ranked as A, A'', A''', ..., B, B', B''', ... and so on.

Similarity metric is based on usage of words. After ranking, use case is to arrange documents for reading so that similar documents are read together like A, A'', A''', ..., B, B', B''', ..., Z, Z', Z''.

Can I use TF-IDF to achieve this ranking? Is there any C library for doing this?

Hemanthkumar
  • 51
  • 1
  • 6

1 Answers1

0

Couple of questions:

  1. What type of similarity metric are you using?
  2. Can a document appear in A and B?

One metric you can use is the words of the document. You can calculate TF-IDF for each document and then query the documents with key phrases.

E.g. if you want to find a set of documents that talk about programming you can search all the documents with the query:

programming code coding

And then the resulting set will be documents that are similar via these key words. And it is possible for you to have the same documents appearing in each query.

I'm not too sure about C libraries, but in python you can use textblob to easily calculate tf-idf. You could probably build this from scratch.

Warden
  • 106
  • 5
  • 1) Similarity metric is based on usage of words. 2) No, a document can appear only once. After ranking, use case is to arrange documents for reading so that similar documents are read together like A, A'', A''', ..., B, B', B''', ..., Z, Z', Z''. – Hemanthkumar Feb 24 '16 at 06:01