6

I am looking for ways to improve the accuracy of TF-IDF weighing scheme in string matching (similarity). The main issue is that TF-IDF is sensitive to typographical errors in stings, and most large datasets tend to have typos. I realised variants of edit distance (character-based similarity metrics---levienshtein, affine-gas, Jaro and Jaro-winkler) are suitable for computing similarity between strings where there are typographical errors, but not suitable when words are out of order in strings.

Hence I would like to use edit distance correcting ability to enhance the accuracy of TF-IDF.

Any ideas on how to address this challenge will be highly appreciated.

Thanks in advance.

user2274879
  • 349
  • 1
  • 5
  • 16

1 Answers1

0

There is a paper published by CMU researchers in 2003 and they have explained how to combine TFIDF with Jaro-Winkler: https://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf

Their Java code is also available on sourceforge as secondString project: https://sourceforge.net/projects/secondstring/

Here is a link to Javadocs: http://secondstring.sourceforge.net/javadoc/

The secondString project page: http://secondstring.sourceforge.net/

Amin
  • 111
  • 9