0

I'm trying to make such a software which makes 2 text documents intelligently sort of like checking how much the text matches, not like DIFF I have searched a quite on Google, And I found 2 things which is Graph & TFIDF.

But I'm confused between both of them, I don't know which one is better & also is there any other technique to match text documents

Akshay Chordiya
  • 4,761
  • 3
  • 40
  • 52

1 Answers1

1

Did you look at measuring document similarity by cosine distance? Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them http://en.wikipedia.org/wiki/Cosine_similarity

If you have Document A and B, You can create two term vectors for the doc A and B. The term vector A would contain words form document A and each words frequency of the document. Instead of raw word frequency you can you TF-IDF weighting. Same goes for doc B. Once you have Term vector A and B you can calculate cosine similarity of term vector A and B which represents doc A and B. Before creating term vectors you do some pre-processing tasks like filtering stop-words.

Kasun
  • 236
  • 1
  • 4
  • 14
  • Excellent answer. But can we use AI ? – Akshay Chordiya Feb 20 '15 at 11:46
  • Do you mean measuring semantic similarity? i.e. similarity of two documents based on meaning or semantic content? You can measure semantic similarity by using ontologies to define the distance between terms/concepts which are inside the documents. – Kasun Feb 23 '15 at 03:40
  • Can we use TF-IDF? Cause IDF is log(total doc/no of doc with the term). So in a case with two documents itls log(2/2) which is 0 right? – Ravindu Jul 03 '18 at 05:53