7

I have 2 queries:

    query1:你好世界
    query2:你好

When i run this code using the python library Levenshtein:

from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist

I get an output of 12. Now the question is how is the value 12 derived?

Because in terms of strokes difference, theres definitely more than 12.

jxn
  • 7,685
  • 28
  • 90
  • 172

1 Answers1

4

According to its documentation, it supports unicode:

It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses).

You need to make sure the Chinese characters are in unicode though:

In [1]: from Levenshtein import distance, hamming, median

In [2]: query1 = '你好世界'

In [3]: query2 = '你好'

In [4]: print distance(query1,query2)
6

In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2
Fabricator
  • 12,722
  • 2
  • 27
  • 40