-1

I am comparing one list of universities with 12 other lists, finding fuzzy string matches and writing all results to a csv. I am not doing the fuzzy string match to one big list as I need to know what list the match came from. Example of the lists:

data = [[1-00000, "MIT"], [1-00001, "Stanford"] ,...]

Data1 = ['MASSACHUSETTS INSTITUTE OF TECHNOLOGY (MIT)'], ['STANFORD UNIVERSITY'],...

With StackOverflow's help I got as far as:

for uni in data:
    hit = process.extractOne(str(uni[1]), data10, scorer = fuzz.token_set_ratio, score_cutoff = 90)
    try:
        if float(hit[1]) >= 94:
            with open(filename, mode='a', newline="") as csv_file:
                fieldnames = ['bwbnr', 'uni_name', 'match', 'points']
                writer = csv.DictWriter(csv_file, fieldnames=fieldnames, delimiter=';')
                writer.writerow({'bwbnr': str(uni[0]), 'uni_name': str(uni[1]), 'match': str(hit), 'points': 10})

    except:
        hit1 = process.extractOne(str(uni[1]), data11, scorer = fuzz.token_set_ratio, score_cutoff = 90)
        try:
            if float(hit1[1]) >= 94:
                with open(filename, mode='a', newline="") as csv_file:
                      fieldnames = ['bwbnr', 'uni_name', 'match', 'points']
                      writer = csv.DictWriter(csv_file, fieldnames=fieldnames, delimiter=';')
                      writer.writerow({'bwbnr': str(uni[0]), 'uni_name': str(uni[1]), 'match': str(hit), 'points': 5})

Going down the 12 lists until the last excepts where I include those with scores lower than 94 and end with a "not found":

    except:
        hit12 = process.extractOne(str(uni[1]), data9, scorer = fuzz.token_set_ratio)
        try:
            if float(hit12[1]) < 94:
                with open(filename, mode='a', newline="") as csv_file:
                       fieldnames = ['bwbnr', 'uni_name', 'match', 'points']
                       writer = csv.DictWriter(csv_file, fieldnames=fieldnames, delimiter=';')
                       writer.writerow({'bwbnr': str(uni[0]), 'uni_name': str(uni[1]), 'match': str(hit), 'points': 3})
        except:
            with open(filename, mode='a', newline="") as csv_file:
                  fieldnames = ['bwbnr', 'uni_name', 'match', 'points']
                  writer = csv.DictWriter(csv_file, fieldnames=fieldnames, delimiter=';')
                  writer.writerow({'bwbnr': str(uni[0]), 'uni_name': str(uni[1]), 'match': str(hit), 'points': 3})

However, I am returned only 2854 results as opposed to the 3175 in my original list (which all need to be checked and written to the new csv).

When I throw all my lists together and do my extractOne I do get 3175 results:

scored_testdata = []
for uni in data:
     hit = process.extractOne(str(uni[1]), big_list, scorer = fuzzy.token_set_ratio, score_cutoff = 90)
     scored_testdata.append(hit)
print(len(scored_testdata))

What am I missing here? I get the feeling results returning "None" in the process.extractOne are being dropped for some reason. Any help would be much appreciated.

The full code can be found here.

Uralan
  • 79
  • 1
  • 9
  • please fix your indentation - your code is no [mcve] -it is hard to see what you do... – Patrick Artner Jan 31 '19 at 15:40
  • Why do you write _nothing_ into your file for `if float(hit[1]) >= 94: ...` ? you only write empty strings in it... - why the try: except: at all? ... the code makes not much sense to me - sorry – Patrick Artner Jan 31 '19 at 15:41
  • Please reduce your code to a meaningful [mcve] that contains demodata and replicates your problems. – Patrick Artner Jan 31 '19 at 15:43
  • Apologies, I dropped what's written to the csv as it made the code messy and long, focused too much on the "minimal" requirement. Added a link to the full code. Editing the question now. Thanks! – Uralan Jan 31 '19 at 15:49

1 Answers1

0

The final try-except should have been one checking all the lists and doing an extractBest without score_cutoff:

except:
    hit12 = process.extractOne(str(uni[1]), big_list, scorer = fuzz.token_set_ratio)
    with open(filename, mode='a', newline="") as csv_file:
           fieldnames = ['bwbnr', 'uni_name', 'match', 'confidence', 'points']
           writer = csv.DictWriter(csv_file, fieldnames=fieldnames, delimiter=';')
           writer.writerow({'bwbnr': str(uni[0]), 'uni_name': str(uni[1]), 'match': "CHECK AGAIN " + str(hit12[0]), 'confidence': str(hit12[1]), 'points': 3})
Uralan
  • 79
  • 1
  • 9