1

I have dataframe contains a lot of typo name, it have shape like this

Col A Col B Col C Col D
A 1 Daniel Sunday
A 1 Dan Sunday
A 1 Danil Sunday
A 2 Charles Sunday
A 2 Charls Monday
B 1 Andi Sunday
B 1 Andy Sunday

I want to fuzzywuzzy data in every Col C if Col A and Col B have similar value, I mean A1 just fuzzywuzzy Daniel Dan and Danil, A2 just fuzzywuzzy Charles and Charls, B1 just fuzzywuzzy Andi and Andy then if the score is low the id will be different

Col A Col B Col C Col D score id
A 1 Daniel Sunday 1 1
A 1 Dan Sunday 0.7 2
A 1 Danil Sunday 0.9 1
A 2 Charles Sunday 1 3
A 2 Charls Monday 0.8 4
B 1 Andi Sunday 1 5
B 1 Andy Sunday 0.9 5

I try to see the fuzzywuzzy score in every row in specific datafrmae column based on some condition

Arthur
  • 17
  • 6

1 Answers1

1

I reproduced the experiment using the fuzzywuzzy library and deduced logic that I will explain step by step. At the end of the answer I will put the full code.

FuzzyWuzzy library

First you need to install this library:

pip install fuzzywuzzy

In the fuzzywuzzy library there is a process function that allows the score to be obtained by comparing an initial string (called a query) and a list of strings.


Extract data and aggregation

  1. I extracted the data from your question in json form.
  2. That way, with pandas.DataFrame.from_dict(), i can create the exact dataframe.

Following your aggregation logic, we can use a pandas groupby to get the word groups to compare:

df.groupby(['Col A', 'Col B'])

Full code

At this point you simply iterate over the retrieved groups applying the id generation logic you requested and append the results to the initial dataframe.

Below is the complete code with comments:

import pandas as pd
from fuzzywuzzy import process


data = [
    {
        "Col A": "A",
        "Col B": "1",
        "Col C": "Daniel",
        "Col D": "Sunday"
    },
    {
        "Col A": "A",
        "Col B": "1",
        "Col C": "Dan",
        "Col D": "Sunday"
    },
    {
        "Col A": "A",
        "Col B": "1",
        "Col C": "Danil",
        "Col D": "Sunday"
    },
    {
        "Col A": "A",
        "Col B": "2",
        "Col C": "Charles",
        "Col D": "Sunday"
    },
    {
        "Col A": "A",
        "Col B": "2",
        "Col C": "Charls",
        "Col D": "Monday"
    },
    {
        "Col A": "B",
        "Col B": "1",
        "Col C": "Andi",
        "Col D": "Sunday"
    },
    {
        "Col A": "B",
        "Col B": "1",
        "Col C": "Andy",
        "Col D": "Sunday"
    }
]

df = pd.DataFrame.from_dict(data)  # create dataframe from dict

id_counter = 1 
score_th = 0.91  # your score threshold
scores, ids = [], []

for group_id, group_df in df.groupby(['Col A', 'Col B']):

    values = group_df['Col C'].tolist()  # extract 'Col C' values as a list

    query = values[0]  # first element of list is the good string
    choices = values[1:]  # others remaining elements are wrongs

    scores.append(1)  # first element has score 1
    ids.append(id_counter)  # first element has the current iteration id

    res = process.extract(query, choices)  # make fuzzywuzzy

    for word, score in res:
        score = score / 100
        if score < score_th:  # if score is lower then threshold, increment counter
            id_counter += 1

        scores.append(score)
        ids.append(id_counter)

df['score'] = scores
df['id'] = ids

print(df)

Output will be:

  Col A Col B    Col C   Col D  score  id
0     A     1   Daniel  Sunday   1.00   1
1     A     1      Dan  Sunday   0.91   1
2     A     1    Danil  Sunday   0.90   2
3     A     2  Charles  Sunday   1.00   2
4     A     2   Charls  Monday   0.92   2
5     B     1     Andi  Sunday   1.00   2
6     B     1     Andy  Sunday   0.75   3

P.S.: Note that the score does not come out exactly as you proposed. Maybe you need to investigate in the library if there is some parameter to set about it.

Giuseppe La Gualano
  • 1,491
  • 1
  • 4
  • 24