How to combine TF-IDF with edit distance or Jaro-winkler distance

Question

I am looking for ways to improve the accuracy of TF-IDF weighing scheme in string matching (similarity). The main issue is that TF-IDF is sensitive to typographical errors in stings, and most large datasets tend to have typos. I realised variants of edit distance (character-based similarity metrics---levienshtein, affine-gas, Jaro and Jaro-winkler) are suitable for computing similarity between strings where there are typographical errors, but not suitable when words are out of order in strings.

Hence I would like to use edit distance correcting ability to enhance the accuracy of TF-IDF.

Any ideas on how to address this challenge will be highly appreciated.

Thanks in advance.

https://code.google.com/p/pupsniffer/source/browse/PupSniffer/src/com/wcohen/ss/SoftTFIDF.java?r=75 — Neil McGuigan, Sep 23 '14 at 22:37

score 0 · Answer 1 · answered Sep 08 '16 at 21:00

There is a paper published by CMU researchers in 2003 and they have explained how to combine TFIDF with Jaro-Winkler: https://www.cs.cmu.edu/~pradeepr/papers/ijcai03.pdf

Their Java code is also available on sourceforge as secondString project: https://sourceforge.net/projects/secondstring/

Here is a link to Javadocs: http://secondstring.sourceforge.net/javadoc/

The secondString project page: http://secondstring.sourceforge.net/

How to combine TF-IDF with edit distance or Jaro-winkler distance

1 Answers1