Finding all similar values in pandas using SequenceMatcher Python

Question

I'm trying to filter on a specific value in pandas in a column but also allow for typing mistakes. I thought using SequenceMatcher was a good solution but I don't know what the best way is to apply it within a DataFrame. Let's say the headers are 'number' and 'location'.

df1 = [[1, Amsterdam], [2, amsterdam], [3, rotterdam], [4, amstrdam], [5, Berlin]]

If I want to filter on 'amsterdam' with a certain ratio, let's say 0.6. The output probably would be like this.

df1 = [[1, Amsterdam], [2, amsterdam], [4, amstrdam]]

What would be the best way to get this done? I was thinking about using an filter option but that didn't work. Do I need to first run an apply function to add a column with the ratio and then be able to filter on it? Or is there a smarter way?

df2 = df1[SequenceMatcher(None, location, df1.location).ratio() > 0.6]

score 2 · Accepted Answer · answered Jun 30 '20 at 08:28

2

You were on the right track, using apply and loc you can filter the df. I've put the ratio to 0.7 otherwise rotterdam would also be matched.

import difflib
import pandas as pd

df = pd.DataFrame([[1, 'Amsterdam'], [2, 'amsterdam'], [3, 'rotterdam'], [4, 'amstrdam'], [5, 'Berlin']])
df.columns = ['number', 'location']

df = df.loc[df.apply(lambda x: difflib.SequenceMatcher(None, 'Amsterdam', x.location).ratio() > 0.7, axis=1)]

print(df)

   number   location
0       1  Amsterdam
1       2  amsterdam
3       4   amstrdam

answered Jun 30 '20 at 08:28

ScootCork

3,411
12
22

Ah, this solved the problem. Thank you. Could you maybe elaborate how the lambda function works in this case? I seem to not get quite the hang of it. – Hestaron Jun 30 '20 at 08:37
1

With `apply` you apply a function on each row (axis=1) or column (axis=0) of the dataframe. Apply passes the row to the first parameter of the function, in our case the `x` parameter of our lambda function. The lambda then does the sequencematch with the given city 'Amsterdam' and the location from the current row `x.location`. It returns `True` or `False` based on the `ratio > 0.7` comparison. These True/False values are then used by loc to determine which rows to keep/filter. – ScootCork Jun 30 '20 at 08:42
Ah I understand. The apply function 'uses' the whole DataFrame and the lambda function is smart enough to cut that DataFrame into pieces row or columnwise depending on the axis parameter provided. And every 'cut' piece is seen as an x value which is passed to the SequenceMatcher function. Thank you for elaborating! – Hestaron Jun 30 '20 at 08:53
1

Almost, apply calls the lambda function for each row (axis=1 is a parameter of apply). The lambda function just takes the given row and returns True or False, it is not 'smart'. – ScootCork Jun 30 '20 at 09:00

Finding all similar values in pandas using SequenceMatcher Python

1 Answers1