0

I'm trying to use some code the runs the Jaro Winkler function to compare the similiarity of two strings. If I just hard code in two values, john and jon, I get no problems using the logic below. However what I want is to use a csv file and compare all of the names. When I try that I'm getting

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

# Python3 implementation of above approach
from math import floor
import pandas as pd

# Function to calculate the
# Jaro Similarity of two strings
def jaro_distance(s1, s2):
    # If the strings are equal
    if (s1 == s2):
        return 1.0;

    # Length of two strings
    len1 = len(s1);
    len2 = len(s2);

    if (len1 == 0 or len2 == 0):
        return 0.0;

    # Maximum distance upto which matching
    # is allowed
    max_dist = (max(len(s1), len(s2)) // 2) - 1;

    # Count of matches
    match = 0;

    # Hash for matches
    hash_s1 = [0] * len(s1);
    hash_s2 = [0] * len(s2);

    # Traverse through the first string
    for i in range(len1):

        # Check if there is any matches
        for j in range(max(0, i - max_dist),
                       min(len2, i + max_dist + 1)):

            # If there is a match
            if (s1[i] == s2[j] and hash_s2[j] == 0):
                hash_s1[i] = 1;
                hash_s2[j] = 1;
                match += 1;
                break;

    # If there is no match
    if (match == 0):
        return 0.0;

    # Number of transpositions
    t = 0;

    point = 0;

    # Count number of occurrences
    # where two characters match but
    # there is a third matched character
    # in between the indices
    for i in range(len1):
        if (hash_s1[i]):

            # Find the next matched character
            # in second string
            while (hash_s2[point] == 0):
                point += 1;

            if (s1[i] != s2[point]):
                point += 1;
                t += 1;
            else:
                point += 1;

        t /= 2;

    # Return the Jaro Similarity
    return ((match / len1 + match / len2 +
             (match - t) / match) / 3.0);


# Jaro Winkler Similarity
def jaro_Winkler(s1, s2):
    jaro_dist = jaro_distance(s1, s2);

    # If the jaro Similarity is above a threshold
    if (jaro_dist > 0.7):

        # Find the length of common prefix
        prefix = 0;

        for i in range(min(len(s1), len(s2))):

            # If the characters match
            if (s1[i] == s2[i]):
                prefix += 1;

            # Else break
            else:
                break;

        # Maximum of 4 characters are allowed in prefix
        prefix = min(4, prefix);

        # Calculate jaro winkler Similarity
        jaro_dist += 0.1 * prefix * (1 - jaro_dist);

    return jaro_dist;


# Driver code
if __name__ == "__main__":
    df = pd.read_csv('names.csv')
    # s1 = 'john' -- this works
    # s1 = 'jon' -- this works
    s1 = df['name1'] --this doesn't. csv contains header row name1, name2, and a few rows in each
    s2 = df['name2'] --this doesn't

    print("Jaro-Winkler Similarity =", jaro_Winkler(s1, s2));
Traceback (most recent call last):
  File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 113, in <module>
    print("Jaro-Winkler Similarity =", jaro_Winkler(s1, s2));
  File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 80, in jaro_Winkler
    jaro_dist = jaro_distance(s1, s2);
  File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 9, in jaro_distance
    if (s1 == s2):
  File "C:\Users\john\PycharmProjects\heatMap\venv\lib\site-packages\pandas\core\generic.py", line 1537, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Process finished with exit code 1

Sample from csv enter image description here

Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
  • Willy you please add the full traceback? It's much harder to tell where the error is from _just_ by reading your code. –  Dec 17 '21 at 18:52
  • Your function expects two strings. You are trying to pass it two pandas series. That will not work. You may be able to use `apply` to do this, although I don't know how that would work with TWO columns. You may end up having to iterate through the rows. – Tim Roberts Dec 17 '21 at 18:55
  • Can you provide a sample row from that names.csv? – Antony Hatchkins Dec 17 '21 at 19:08
  • 1
    Does this answer your question? [How to apply a function on every row on a dataframe?](https://stackoverflow.com/questions/33518124/how-to-apply-a-function-on-every-row-on-a-dataframe) – Nick ODell Dec 17 '21 at 19:51
  • I know that's the canned text, but I would go further: YES, this answers the question. – Tim Roberts Dec 17 '21 at 19:52
  • please provide a small sample of your data – anon01 Dec 17 '21 at 19:56

0 Answers0