Calculate Levenshtein distance using pandas DataFrames

Question

I'm trying to calculate Levenshtein distance for the following pandas DataFrame. I'm using this package for it.

In [22]: df = pd.DataFrame({'id' : [1,2,3,4,5,6,7],
                'path'  : ["abc,cde,eg,ba","abc,cde,ba","abc,yz,zx,eg","abc,cde,eg,ba","abc,cde","abc","cde,eg,ba"]})

In [23]: df
Out[23]: 
   id           path
0   1  abc,cde,eg,ba
1   2     abc,cde,ba
2   3   abc,yz,zx,eg
3   4  abc,cde,eg,ba
4   5        abc,cde
5   6            abc
6   7      cde,eg,ba

Following is my implementation.

In [18]: d = {'abc':'1', 'cde':'2', 'eg':'3', 'ba':'4', 'yz':'5', 'zx':'6'}

In [19]: d
Out[19]: {'abc': '1', 'ba': '4', 'cde': '2', 'eg': '3', 'yz': '5', 'zx': '6'}

In [20]: a = [jellyfish.levenshtein_distance(*map(d.get, item)) for item in itertools.combinations(d,2)]

In [21]: a
Out[21]: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Why it not compare the strings as follows? and why just print 1?

In [22]: list(itertools.combinations(d,2))
Out[22]: 
[('cde', 'abc'),
 ('cde', 'ba'),
 ('cde', 'eg'),
 ('cde', 'yz'),
 ('cde', 'zx'),
 ('abc', 'ba'),
 ('abc', 'eg'),
 ('abc', 'yz'),
 ('abc', 'zx'),
 ('ba', 'eg'),
 ('ba', 'yz'),
 ('ba', 'zx'),
 ('eg', 'yz'),
 ('eg', 'zx'),
 ('yz', 'zx')]

Are you working in a dataframe or just a regular dictionary? — Woody Pride, May 25 '14 at 04:21
Well I don't see the relationship between the dataframe column 'path' and the distance you want to compute. Moreover, do you want to compute distance between keys of the dictionary or its associated values. — Guillaume Jacquenot, May 25 '14 at 10:04

score 0 · Answer 1 · answered May 25 '14 at 04:24

The list comprehension does not seem to be set up correctly. I don't really understand the relationship between your DataFrame and the implementation, but it seems like the list comprehension in your implementation is not doing what you expect it to. Would the following be what you are looking for?

a = [jf.levenshtein_distance(x[0], x[1]) for x in itertools.combinations(d,2)]

Calculate Levenshtein distance using pandas DataFrames

1 Answers1