3

I'm using fuzzywuzzy to find near matches in a csv of company names. I'm comparing manually matched strings with the unmatched strings in the hope of finding some useful proximity matches, however, I'm getting a string or buffer error within fuzzywuzzy. My code is:

from fuzzywuzzy import process
from pandas import read_csv

if __name__ == '__main__':
    df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
    df_false = df[df['match_manual'].isnull()]  
    df_true = df[df['match_manual'].notnull()]
    sss_false = df_false['sss'].values.tolist()
    sss_true = df_true['sss'].values.tolist()


    for sssf in sss_false:
        mmm = process.extractOne(sssf, sss_true) # find best choice
        print sssf + str(tuple(mmm))

This creates the following error:

Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer

This is something to do with the effects of importing into pandas with encoding specified, which I added to prevent UnicodeDecodeErrors but had the knock on effect of causing this error. I've tried to force the object using str(sssf) but that doesn't work.

So, I've isolated a line that is causing the error, here: #N/A,,,,,, (line 29 in code pasted below). I assumed it was the # that was causing the error, but strangely its not, its the A char that is causing the problem, because the file works when it is removed. What is strange to me is that the string two rows below is N/A which parses fine, however, row 29 won't parse when I delete the # symbol, even though the field appears identical to the field below.

sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,
woodbine
  • 553
  • 6
  • 26
  • Could you put up some sample data? Its hard to work on it without any. – Gregory Arenius Jun 04 '15 at 04:30
  • I've tried to test for the occurrence of a string by adding an `isinstance` test, which hasn't worked. My csv is 800k lines, so I'm going to go through a process of splitting down the csv to isolate the offending line (sigh). Will post offending data when I find it. – woodbine Jun 04 '15 at 07:52

2 Answers2

2

By default, pandas.read_csv parses the string 'N/A' as Not a Number (NaN)

In your case, that means that you end up with a nan value rather than a string. In your sample data set, this happens in two places

The third line from the bottom (the line you highlight in the question) results in sss_false[-3] == nan

The last line results in sss_true[-1] == nan.


Option 1

If you want to parse the string 'N/A' as a string instead of nan, the way to do this is to replace

df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")

with

df = read_csv("usm_clean.csv", encoding = "ISO-8859-1", keep_default_na=False, na_values='')

The meaning of these extra options is described in the pandas docs.

na_values : list-like or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

keep_default_na : bool, default True

If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to

So, the above modification tells pandas to recognize the empty string as NA and discard the default value 'N/A'


Option 2

If you want to discard lines with 'N/A' in the first column you need to remove the nan members from sss_true and sss_false. one way to do this is:

sss_true = [x for x in sss_true if type(x) != str]
sss_false = [x for x in sss_false if type(x) != str]
Community
  • 1
  • 1
J Richard Snape
  • 20,116
  • 5
  • 51
  • 79
  • Thanks so much, really helpful. One question, why would `#N/A` (third from bottom) trigger the error and `N/A` not trigger it? (I'm running the same code and both lines trigger a `nan` return). – woodbine Jun 15 '15 at 14:40
  • The `'#N/A'` puts the `nan` into the `sss_false` list, whereas the `'N/A'` on the bottom line puts the nan into the `sss_true` list (due to the value in `'match_manual'` column). Therefore, both will put `nan` into a list, but the effects will be different depending on whether it's in the value passed as `query` to `extractOne()`, or as `choices`. I think you may have slightly misdiagnosed your error. If I remove the `'#N/A'` line from the file, I still get the error. If I remove the last line, I get a different error. If I remove both, or fix as per answer, it works OK – J Richard Snape Jun 15 '15 at 15:05
  • II hope that clarifies things - I can put in references to the github codebase if you like - a `nan` in choices causes failure [here](https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/string_processing.py#L25) via [this call](https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py#L103). A nan in `query` causes failure only when you do `print sssf + str(tuple(mmm))` at which point you are trying to concatenate a `float` and a `str`. Would appreciate an *accept* if that clears it all up :) – J Richard Snape Jun 15 '15 at 15:08
  • No, that's perfect (and obvious) I was forgetting that we're treating the strings differently at different times. – woodbine Jun 15 '15 at 16:00
0

Your sss_true variable contains:

[
    u'N21 LTD.',
    u'N2 CHECK LIMITED',
    u'N2 CHECK LTD',
    u'N2 GROUP LTD',
    u'N2 VISUAL COMMUNICATIONS LTD',
    u'N3 DISPLAY GRAPHICS LTD',
    u'N3O LIMITED',
    u'N9 DESIGN',
    nan              # <---- note this
]

Once you get rid of that not-a-number value everything starts to work as expected.

dlask
  • 8,776
  • 1
  • 26
  • 30