0

I want to return a new column on my dataframe with the word more similar to my pandas column value (in this case col1). My actual dataframe is:

enter image description here

And I have the following list:

['Product_A1', 'Product_B1', 'Product_C']

And my output should be:

enter image description here

For that I am using the following code (i'm just printing the results):

import pandas as pd
import difflib
d = {'col1': ['Product_Z1', 'Product_A', 'Product_B'], 'col2': [1, 2, 3]}
df = pd.DataFrame(data=d)
products_list = ['Product_A', 'Product_B', 'Product_C']
print(difflib.get_close_matches(df['col1'], products_list))

However I'm getting always a empty list...

What I am doing wrong?

Thanks!

Community
  • 1
  • 1
Pedro Alves
  • 1,004
  • 1
  • 21
  • 47
  • If there is a certain pattern, you could compare a subset of the strings or use a regex. – jimfawkes Nov 22 '19 at 20:29
  • 1
    Answer below by Hugo Salvador has embedded in it the answer to your "what am I doing wrong" question, but just for clarification: the first arg to `get_close_matches` needs to be a string, not a list of strings. – RishiG Nov 22 '19 at 20:37

1 Answers1

2

Try this:

df['col3'] = df['col1'].apply(lambda x : difflib.get_close_matches(x, products_list, cutoff=0.9))

The issue about your solution is that the get_close_matches function is looking for the whole column in the products_list. Add the lines bellow and see:

In [8]: products_list2 = [['Product_Z1', 'Product_A', 'Product_B'], ['test']]

In [9]: difflib.get_close_matches(df['col1'], products_list2)
Out[9]: [['Product_Z1', 'Product_A', 'Product_B']]
Hugo Salvador
  • 1,094
  • 1
  • 11
  • 11
  • 1
    Good answer as far as code goes. Could be slightly improved by explicitly answering the question "what went wrong?" and including a link to the `difflib.get_close_matches` documentation. – RishiG Nov 22 '19 at 20:39