Best way to match 2 text documents

Question

I'm trying to make such a software which makes 2 text documents intelligently sort of like checking how much the text matches, not like DIFF I have searched a quite on Google, And I found 2 things which is Graph & TFIDF.

But I'm confused between both of them, I don't know which one is better & also is there any other technique to match text documents

score 1 · Accepted Answer · answered Feb 19 '15 at 05:54

1

Did you look at measuring document similarity by cosine distance? Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them http://en.wikipedia.org/wiki/Cosine_similarity

If you have Document A and B, You can create two term vectors for the doc A and B. The term vector A would contain words form document A and each words frequency of the document. Instead of raw word frequency you can you TF-IDF weighting. Same goes for doc B. Once you have Term vector A and B you can calculate cosine similarity of term vector A and B which represents doc A and B. Before creating term vectors you do some pre-processing tasks like filtering stop-words.

answered Feb 19 '15 at 05:54

Kasun

236
1
4
14

Excellent answer. But can we use AI ? – Akshay Chordiya Feb 20 '15 at 11:46
Do you mean measuring semantic similarity? i.e. similarity of two documents based on meaning or semantic content? You can measure semantic similarity by using ontologies to define the distance between terms/concepts which are inside the documents. – Kasun Feb 23 '15 at 03:40
Can we use TF-IDF? Cause IDF is log(total doc/no of doc with the term). So in a case with two documents itls log(2/2) which is 0 right? – Ravindu Jul 03 '18 at 05:53

Best way to match 2 text documents

1 Answers1