I reproduced the experiment using the fuzzywuzzy library and deduced logic that I will explain step by step. At the end of the answer I will put the full code.
FuzzyWuzzy library
First you need to install this library:
pip install fuzzywuzzy
In the fuzzywuzzy library there is a process
function that allows the score to be obtained by comparing an initial string (called a query) and a list of strings.
Extract data and aggregation
- I extracted the data from your question in json form.
- That way, with pandas.DataFrame.from_dict(), i can create the exact dataframe.
Following your aggregation logic, we can use a pandas groupby to get the word groups to compare:
df.groupby(['Col A', 'Col B'])
Full code
At this point you simply iterate over the retrieved groups applying the id generation logic you requested and append the results to the initial dataframe.
Below is the complete code with comments:
import pandas as pd
from fuzzywuzzy import process
data = [
{
"Col A": "A",
"Col B": "1",
"Col C": "Daniel",
"Col D": "Sunday"
},
{
"Col A": "A",
"Col B": "1",
"Col C": "Dan",
"Col D": "Sunday"
},
{
"Col A": "A",
"Col B": "1",
"Col C": "Danil",
"Col D": "Sunday"
},
{
"Col A": "A",
"Col B": "2",
"Col C": "Charles",
"Col D": "Sunday"
},
{
"Col A": "A",
"Col B": "2",
"Col C": "Charls",
"Col D": "Monday"
},
{
"Col A": "B",
"Col B": "1",
"Col C": "Andi",
"Col D": "Sunday"
},
{
"Col A": "B",
"Col B": "1",
"Col C": "Andy",
"Col D": "Sunday"
}
]
df = pd.DataFrame.from_dict(data) # create dataframe from dict
id_counter = 1
score_th = 0.91 # your score threshold
scores, ids = [], []
for group_id, group_df in df.groupby(['Col A', 'Col B']):
values = group_df['Col C'].tolist() # extract 'Col C' values as a list
query = values[0] # first element of list is the good string
choices = values[1:] # others remaining elements are wrongs
scores.append(1) # first element has score 1
ids.append(id_counter) # first element has the current iteration id
res = process.extract(query, choices) # make fuzzywuzzy
for word, score in res:
score = score / 100
if score < score_th: # if score is lower then threshold, increment counter
id_counter += 1
scores.append(score)
ids.append(id_counter)
df['score'] = scores
df['id'] = ids
print(df)
Output will be:
Col A Col B Col C Col D score id
0 A 1 Daniel Sunday 1.00 1
1 A 1 Dan Sunday 0.91 1
2 A 1 Danil Sunday 0.90 2
3 A 2 Charles Sunday 1.00 2
4 A 2 Charls Monday 0.92 2
5 B 1 Andi Sunday 1.00 2
6 B 1 Andy Sunday 0.75 3
P.S.: Note that the score does not come out exactly as you proposed. Maybe you need to investigate in the library if there is some parameter to set about it.