I'm trying to calculate Levenshtein distance for the following pandas DataFrame
. I'm using this package for it.
In [22]: df = pd.DataFrame({'id' : [1,2,3,4,5,6,7],
'path' : ["abc,cde,eg,ba","abc,cde,ba","abc,yz,zx,eg","abc,cde,eg,ba","abc,cde","abc","cde,eg,ba"]})
In [23]: df
Out[23]:
id path
0 1 abc,cde,eg,ba
1 2 abc,cde,ba
2 3 abc,yz,zx,eg
3 4 abc,cde,eg,ba
4 5 abc,cde
5 6 abc
6 7 cde,eg,ba
Following is my implementation.
In [18]: d = {'abc':'1', 'cde':'2', 'eg':'3', 'ba':'4', 'yz':'5', 'zx':'6'}
In [19]: d
Out[19]: {'abc': '1', 'ba': '4', 'cde': '2', 'eg': '3', 'yz': '5', 'zx': '6'}
In [20]: a = [jellyfish.levenshtein_distance(*map(d.get, item)) for item in itertools.combinations(d,2)]
In [21]: a
Out[21]: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Why it not compare the strings as follows? and why just print 1?
In [22]: list(itertools.combinations(d,2))
Out[22]:
[('cde', 'abc'),
('cde', 'ba'),
('cde', 'eg'),
('cde', 'yz'),
('cde', 'zx'),
('abc', 'ba'),
('abc', 'eg'),
('abc', 'yz'),
('abc', 'zx'),
('ba', 'eg'),
('ba', 'yz'),
('ba', 'zx'),
('eg', 'yz'),
('eg', 'zx'),
('yz', 'zx')]