Token-based edit distance in Python?

Question

I'm familiar with python's nltk.metrics.distance module, which is commonly used to compute edit distance of two string.

I am interested in a function which computes such distance but not char-wise as normally but token-wise. By that I mean that you can replace/add/delete whole tokens only (instead of chars).

Example of regular edit distance and my desired tokenized version:

> char_dist("aa bbbb cc",
            "aa b cc")
3                              # add 'b' character three-times

> token_dist("aa bbbb cc",
             "aa b cc")
1                              # replace 'bbbb' token with 'b' token

Is there already some function, that can compute token_dist in python? I'd rather use something already implemented and tested than writing my own piece of code. Thanks for tips.

score 14 · Answer 1 · answered Apr 14 '17 at 14:58

14

NLTK's edit_distance appears to work just as well with lists as with strings:

nltk.edit_distance("aa bbbb cc", "aa b cc")
> 3
nltk.edit_distance("aa bbbb cc".split(), "aa b cc".split())
> 1

answered Apr 14 '17 at 14:58

dadamson

441
4
7

This answer explain it better. To put it in words - If you pass 2 strings to `editdistance` function, it'll return the character-level-editdistance between the strings. If you pass 2 list-of-strings, the function return token/word-level-editdistance. – NightFury13 Apr 30 '21 at 11:44

CentAu · Accepted Answer · 2016-04-24T19:10:20.903

6

First, install the following:

pip install editdistance

Then the following will give you the token-wise edit distance:

import editdistance
editdistance.eval(list1, list2)

Example:

import editdistance
tokens1 = ['aa', 'bb', 'cc']
tokens2 = ['a' , 'bb', 'cc']
editdistance.eval(tokens1, tokens2)
out[4]: 1

For more information, please refere to:

https://github.com/aflc/editdistance

edited Apr 24 '16 at 19:10

answered Apr 24 '16 at 19:00

CentAu

10,660
15
59
85

Token-based edit distance in Python?

2 Answers2