0

I've been asked to guess the user intention when part of expected data is missing. For example if I'm looking to get very well or not very well but I get only not instead, then I should flag it as not very well.

The Levenshtein distance for not and very well is 9 and the distance for not and not very well is 10. I think I'm actually trying to drive a screw with a wrench, but we have already agreed in our team to use Levenshtein for this case.

As you have seen the problem above, is there anyway if I can make some sense out of it by changing the insertion, replacement and deletion costs?

P.S. I'm not looking for a hack for this particular example. I want something that generally works as expected and outputs a better result in these cases also.

Mahdi
  • 9,247
  • 9
  • 53
  • 74
  • 2
    Soundex might be a better algorithm: https://en.wikipedia.org/wiki/Soundex. Both "not" and "cup" have the same levelshtein distance. IMO, "if (str.match(/^\s*[nN])) {str='not very well'} else {str='very well'}" is simpler. – glenn jackman Feb 12 '14 at 14:56
  • @glennjackman I'm 100% agree with you. That's what I've offered, but the argument was it might not work as expected with other languages rather than English. Thanks anyways, I will bring it up again with our team. – Mahdi Feb 12 '14 at 15:34

1 Answers1

0

The Levenshtein distance for not and very well is actually 12. The alignment is:

------not
very well

So there are 6 insertions with a total cost of 6 (cost 1 for each insertion), and 3 replacements with a total cost of 6 (cost 2 for each replacement). The total cost is 12.

The Levenshtein distance for not and not very well is 10. The alignment is:

not----------
not very well

This includes only 10 insertions. So you can choose not very well as the best match.

The cost and alignment can be computed with htql for python:

import htql
a=htql.Align()
a.align('not', 'very well')
# (12.0, ['------not', 'very well'])
a.align('not', 'not very well')
# (10.0, ['not----------', 'not very well'])
seagulf
  • 380
  • 3
  • 5