Python closest match between two string columns

Question

I am looking to get the closest match between two columns of string data type in two separate tables. I don't think the content matters too much. There are words that I can match by pre-processing the data (lower all letters, replace spaces and stop words, etc...) and doing a join. However I get around 80 matches out of over 350. It is important to know that the length of each table is different.

I did try to use some code I found online but it isn't working:

def Races_chien(df1,df2):

    myList = []
    total = len(df1)
    possibilities = list(df2['Rasse'])

    s = SequenceMatcher(isjunk=None, autojunk=False)

    for idx1, df1_str in enumerate(df1['Race']):
        my_str = ('Progress : ' + str(round((idx1 / total) * 100, 3)) + '%')
        sys.stdout.write('\r' + str(my_str))
        sys.stdout.flush()

        # get 1 best match that has a ratio of at least 0.7
        best_match = get_close_matches(df1_str, possibilities, 1, 0.7)

        s.set_seq2(df1_str, best_match)
        myList.append([df1_str, best_match, s.ratio()])

        return myList

It says: TypeError: set_seq2() takes 2 positional arguments but 3 were given

How can I make this work?

score 1 · Answer 1 · answered Apr 20 '22 at 12:13

1

I think you need s.set_seqs(df1_str, best_match) function instead of s.set_seq2(df1_str, best_match) (docs)

answered Apr 20 '22 at 12:13

svfat

3,273
1
15
34

I get this: `AttributeError: 'SequenceMatcher' object has no attribute 'set_seq'` – Daniel Apr 20 '22 at 13:20
pls check your code, you need set_seqs not set_seq – svfat Apr 20 '22 at 13:22
there is nothing more than what I I put on top, some sql connections to get the two dataf rames and this to run the function :Base_tests `["raceChienSV"]= Races_chien(Base_tests,Base_tests2)` – Daniel Apr 20 '22 at 13:26
I am talking about this error you're mentioned ```'SequenceMatcher' object has no attribute 'set_seq'``` it looks like you've made a typo, you have `set_seq` in the code, but you need to make this line to look exactly like `s.set_seqs(df1_str, best_match)` – svfat Apr 20 '22 at 13:28

score 1 · Accepted Answer · answered Apr 29 '22 at 09:54

Here is an answer I finally got:

from fuzzywuzzy import process, fuzz
value = []
similarity = []
for i in df1.col:
    ratio = process.extract(i, df2.col, limit= 1)
    value.append(ratio[0][0])
    similarity.append(ratio[0][1])

df1['value'] = pd.Series(value)
df1['similarity'] = pd.Series(similarity)

This will add the value with the closest match from df2 in df1 together with the similarity %

score 0 · Answer 3 · answered Apr 20 '22 at 10:13

0

You can use jellyfish library that has useful tools for comparing how similar two strings are if that is what you are looking for.

answered Apr 20 '22 at 10:13

wogisha

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 20 '22 at 12:57

score 0 · Answer 4 · answered Apr 20 '22 at 10:14

0

Try changing:

s = SequenceMatcher(isjunk=None, autojunk=False)

To:

s = SequenceMatcher(None, isjunk=None, autojunk=False)

answered Apr 20 '22 at 10:14

Bob

295
5
19

I get: `TypeError: __init__() got multiple values for argument 'isjunk'` – Daniel Apr 20 '22 at 11:48
@svfat ans looks right. – Bob Apr 20 '22 at 12:33

Python closest match between two string columns

4 Answers4