0

As introduction, I am pretty new to python, I just know how to use pandas mainly for data analysis.

I currently have 2 lists of 100+ entries, "Keywords" and "Groups".

I would like to generate an output (ideally a dataframe in pandas), where for every entry of the list "Keywords", the closest entry of the list "Groups" is assigned, using the levenshtein distance method.

Thank you for your support!

Roberto Bertinetti
  • 555
  • 1
  • 4
  • 10

1 Answers1

1
from editdistance import eval as levenshtein
import pandas as pd

keywords = ["foo", "foe", "bar", "baz"]
groups = ["foo", "bar"]

assigned_groups = [min(groups, key=lambda g: levenshtein(g, k))
                   for k in keywords]

df = pd.DataFrame({"Keyword": keywords, "Group": assigned_groups})
#   Group Keyword
# 0   foo     foo
# 1   foo     foe
# 2   bar     bar
# 3   bar     baz

Using editdistance. Get it with pip install editdistance.

Note that this algorithm is O(mn), where m is the length of the keywords and n the length of the groups.

Graipher
  • 6,891
  • 27
  • 47