Below is a sample of my df
name
A S BITO
A S KIGEL
A S NATURENERGI
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER PETER GMBH
A S P GMBH
A RESE LAND
A RITTER WITH SA
A RITTER WITH SA
A RITTER WITH SA
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER WITH MASCHINE
A RITTER WITH MASCHINE SA CO
A RITTER WITH MASCHINE SA CO
The aim is to replace the name by unique value with most occurence
Below is a list of unique values
name occurences
A S BITO 1
A S KIGEL 1
A S NATURENERGI 1
A S NATURENERGIE 2
A S P BU SERVICE POWER P 2
A S P BU SERVICE POWER PETER GMBH 1
A S P GMBH 1
A RESE LAND 1
A RITTER WITH SA 3
A RITTER SA CO 4
A RITTER WITH MASCHINE 1
A RITTER WITH MASCHINE SA CO 2
As you can see with the DF, some names can be grouped.
However, due to misspelling, there aren't.
The desired output would look like this
name
A S BITO
A S KIGEL
A S NATURENERGIE
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P GMBH
A RESE LAND
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
Below is the code tried
df['name'] = df['name'].replace('A S NATURENERGI', 'A S NATURENERGIE')
df['name'] = df['name'].replace('A S P BU SERVICE POWER PETER GMBH', 'A S P BU SERVICE POWER P')
df['name'] = df['name'].replace('A RITTER WITH SA', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE SA CO ', 'A RITTER SA CO')
However, it might not be the best way to deal with that.
Thus, I was thinking about using difflib and computing a matching score.
The next step will be to replace the name with the highest score match.
f = partial(difflib.get_close_matches, possibilities= df['name'].tolist(), n=1) #
matches = df['name'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio() for x, y in zip(matches, df['name'])]
df_diff = df.assign(best=matches, score=scores)
The downside of this method is that I will retrieve exactly the same name...
So, if anyone has some ideas, thanks a lot!