0

I have some data containing spelling errors. I'm correcting them and scoring how close the spelling is using the following code:

 import pandas as pd
 import difflib

 Li_A = ["potato", "tomato", "squash", "apple", "pear"]

 Q    = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
         'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}

 df_Q = pd.DataFrame(Q)

 # Define the function that Corrects & Scores the Spelling
 def Spelling(ask):
     a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)

     # List comprehension for all values of a
     b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
     return pd.Series(a + b)

 # Apply the function that Corrects & Scores the Spelling
 df_A = df_Q['one'].apply(Spelling)

 # Get the column names on the A dataframe
 c = len(df_A.columns) // 2
 df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
                ['Score_{}'.format(y)    for y in range(c)]

 # Join the Q & A dataframes
 df_QA = df_Q.join(df_A)

This gives the result:

 df_QA
       one     two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4  \
 a  potat0  po1ato     potato     tomato       pear      apple     squash   
 b  toma3o  2omato     tomato     potato       pear      apple     squash   
 c  s5uash  squ0sh     squash       pear      apple     tomato     potato   
 d   ap8le   2pple      apple       pear     tomato     squash     potato   
 e    pea7    p3ar       pear     potato      apple     tomato     squash   

     Score_0   Score_1   Score_2   Score_3   Score_4  
 a  0.833333  0.500000  0.400000  0.181818  0.166667  
 b  0.833333  0.333333  0.200000  0.181818  0.166667  
 c  0.833333  0.200000  0.181818  0.166667  0.166667  
 d  0.800000  0.222222  0.181818  0.181818  0.181818  
 e  0.750000  0.400000  0.444444  0.200000  0.200000  

For row "e", "potato" is in row 1 and "apple" in row 2. However, apple got a higher score than potato. This is the wrong way round for my application.

How do I get the higher scoring results the be consistently to the left please?

Edit 1: I tried a simpler code:

 import difflib
 Li_A = ["potato", "tomato", "squash", "apple", "pear"]
 Q    = "pea7"
 A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)

& got the same result:

 A: ['pear', 'potato', 'apple', 'tomato', 'squash']

I also tried a simpler scoring code:

 import difflib
 S1 = difflib.SequenceMatcher(None, "pea7", "potato")
 R1 = S1.ratio()
 S2 = difflib.SequenceMatcher(None, "pea7", "apple")
 R2 = S2.ratio()

& again I got the same result:

 R1: 0.4
 R2: 0.444

Edit 2 I tried it with fuzzywuzzy. I got the same result again since fuzzywuzzy depends on difflib:

 from fuzzywuzzy import fuzz
 R1 = fuzz.ratio("pea7", "potato")
 R2 = fuzz.ratio("pea7", "apple")
R. Cox
  • 819
  • 8
  • 25
  • hmmm, it is weird. I try `print (a)` and it return bad values, maybe some problem in `a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)` ? – jezrael Feb 20 '18 at 17:02
  • Thanks jezrael. I tried changing it so that there is only one thing called "a": # Get the column names on the A dataframe c = len(df_A.columns) // 2 df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \ ['Score_{}'.format(y) for y in range(c)]. # "a" is now not a value and "c" is now 5. – R. Cox Feb 21 '18 at 09:21
  • And then it working? – jezrael Feb 21 '18 at 09:50
  • No it hasn't changed the result – R. Cox Feb 24 '18 at 09:48
  • Ok, so problem is why function `difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)` return wrong outputs? – jezrael Feb 24 '18 at 10:14
  • Or might SequenceMatcher have a different opinion on how close the results are to get_close_matches? – R. Cox Feb 24 '18 at 10:21
  • I really wold like help you, but I have no idea. – jezrael Feb 24 '18 at 10:22
  • I feel that get_close_matches might be right; that "potato" is closer to "pea7" than "apple" is. That would put the error in the court of SequenceMatcher... – R. Cox Feb 24 '18 at 10:24
  • maybe I should try Fuzzywuzzy? – R. Cox Feb 24 '18 at 10:26
  • I have only some basic experience, so I dont know, unfortunately :( But obviously if some solution failes, the best is try another one. So try `Fuzzywuzzy`. – jezrael Feb 24 '18 at 10:32
  • @R.Cox You can get a better idea of what's happening by looking at `list(difflib.SequenceMatcher(None, 'pea7', 'potato').get_opcodes())` for each... – Jon Clements Feb 24 '18 at 11:32
  • @JonClements thanks. I tried that and got [('equal', 0, 1, 0, 1), ('replace', 1, 2, 1, 3), ('equal', 2, 3, 3, 4), ('replace', 3, 4, 4, 6)] what does that mean please? – R. Cox Feb 26 '18 at 15:19
  • - this describes how to turn 'pea7' into 'potato' – R. Cox Feb 26 '18 at 15:26
  • @R.Cox err... I don't recall off the top of my head but the get_close_matches (you'd need to look at the code for that and SequenceMatcher - they are in Python) calculate the ratio based on the number of steps and type of steps taken to translate one string into another... so it'll give you 1) an understanding of how the ratio is calculated and 2) (possibly) some ideas to tweak the algorithm so that for your cases it's closer to what you expect. It might be involved or even not worth your while - I'm just throwing it out there as an idea. – Jon Clements Feb 26 '18 at 15:27
  • @JonClements thanks I think I'll try to do that. I can see that get_opcodes is telling me what SequenceMatcher is doing. Is there a way to see what get_close_matches is doing? I think that it must be doing something different because it is giving the results a different order. – R. Cox Feb 28 '18 at 09:59
  • SequenceMatcher is correctly calculating the ratio using the method described by Ratcliff and Metzener, 1988. That is, for the number of characters found in common (CC) and the total number of characters in the two strings (CT): ratio = 2.CC/CT So it looks like the issue is with get_close_matches. – R. Cox Apr 20 '18 at 09:32

1 Answers1

0

SequenceMatcher is correctly calculating the ratio using the method described by Ratcliff and Metzener, 1988. That is, for the number of characters found in common (CC) and the total number of characters in the two strings (CT):

ratio = 2.CC/CT 

So it looks like the issue is with get_close_matches

R. Cox
  • 819
  • 8
  • 25