0

I'm trying to discern the string similarity between two strings (using Jaro). Each string resides in a separate column in my dataframe.

String 1 = df['name_one'] 

String 2 = df['name_two']

When I try to run my string similarity logic:

from pyjarowinkler import distance
df['distance'] = df.apply(lambda d: distance.get_jaro_distance(str(d['name_one']),str(d['name_two']),winkler=True,scaling=0.1), axis=1)

I get the following error:

 **error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)**

Great, so there is a nonetype in the columns, so the first thing I do is check for this:

maskone = df['name_one'] == None
df[maskone]

masktwo = df['name_two'] == None
df[masktwo]

This yields in no None types found.... I'm scratching my head here at this point, but proceed to clean the two columns any ways.

df['name_one'] = df['name_one'].fillna('').astype(str)
df['name_two'] = df['name_two'].fillna('').astype(str) 

And yet, I'm still getting:

error: JaroDistanceException: Cannot calculate distance from NoneType (str, str)

Am I removing NoneTypes correctly?

ecoplaneteer
  • 1,918
  • 1
  • 8
  • 29
mikelowry
  • 1,307
  • 4
  • 21
  • 43

1 Answers1

1

Problem

The issue isn't exactly that you are only experiencing NoneTypes but empty strings which can also throw this exception as you can see in the implementation of distance.get_jaro_distance

if not first or not second:
    raise JaroDistanceException("Cannot calculate distance from NoneType ({0}, {1})".format(
        first.__class__.__name__,
        second.__class__.__name__))

Option 1

Trying replacing your none types and/or empty strings with 'NA' or filtering them from your dataset.

Option 2

Use a flag value/distance for rows that may raise this exception . In the example below, I will utilize 999

from pyjarowinkler import distance

df['distance'] = df.apply(lambda d: 999 if not str(d['name_one']) or not str(d['name_two']) else distance.get_jaro_distance(str(d['name_one']),str(d['name_two']),winkler=True,scaling=0.1), axis=1)
ggordon
  • 9,790
  • 2
  • 14
  • 27