2

I have a dataset of random words and names and I am trying to group all of the similar words and names. So given the dataframe below:

     Name           ID            Value
0    James           1             10
1    James 2         2             142
2    Bike            3             1
3    Bicycle         4             1197
4    James Marsh     5             12
5    Ants            6             54
6    Job             7             6
7    Michael         8             80007  
8    Arm             9             47 
9    Mike K          10            9
10   Michael k       11            1

My pseudo code would be something like:

import pandas as pd
from fuzzywuzzy import fuzz

minratio = 95
for idx1, name1 in df['Name'].iteritems():
   for idx2, name2 in df['Name'].iteritems():
      ratio = fuzz.WRatio(name1, name2)
      if ratio > minratio:
          grouped = df.groupby(['Name', 'ID'])['Value']\
                        .agg(Total_Value='sum', Group_Size='count')

This would then give me the desired output:

print(grouped)
     Name           ID            Total_Value          Group_Size
0    James           1             164                     3 # All James' grouped
2    Bike            3             1198                    2 # Bike's and Bicycles grouped
5    Ants            6             54                      1 
6    Job             7             6                       1
7    Michael         8             80017                   3 # Mike's and Michael's grouped
8    Arm             9             47                      1

Obviously this doesn't work, and honestly, I am not sure if this is even possible, but this is what I'm trying to accomplish. Any advice that could get me on the right track would be useful.

DrakeMurdoch
  • 765
  • 11
  • 26

1 Answers1

10

Using affinity propagation clustering (not perfect but maybe a starting point):

import pandas as pd
import numpy as np
import io
from fuzzywuzzy import fuzz
from scipy import spatial
import sklearn.cluster

s="""Name           ID            Value
0    James           1             10
1    James 2         2             142
2    Bike            3             1
3    Bicycle         4             1197
4    James Marsh     5             12
5    Ants            6             54
6    Job             7             6
7    Michael         8             80007  
8    Arm             9             47 
9    Mike K          10            9
10   Michael k       11            1"""
df = pd.read_csv(io.StringIO(s),sep='\s\s+',engine='python')

names = df.Name.values
sim = spatial.distance.pdist(names.reshape((-1,1)), lambda x,y: fuzz.WRatio(x,y))
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", random_state=None)
affprop.fit(spatial.distance.squareform(sim))

res = df.groupby(affprop.labels_).agg(
        Names=('Name',','.join),
        First_ID=('ID','first'),
        Total_Value=('Value','sum'),
        Group_Size=('Value','count')
        )

Result

                                Names  First_ID  Total_Value  Group_Size
0  James,James 2,James Marsh,Ants,Arm         1          265           5
1                        Bike,Bicycle         3         1198           2
2                                 Job         7            6           1
3            Michael,Mike K,Michael k         8        80017           3
Stef
  • 28,728
  • 2
  • 24
  • 52
  • 1
    Why does this put `Ants` and `Arm` in the same category as `James`? – DrakeMurdoch May 26 '20 at 17:40
  • 1
    This doesn't seem to take the `WRatio` into account. – DrakeMurdoch May 26 '20 at 17:47
  • 1
    As I said, it's not perfect. Maybe it can be improved by tuning some parameters or applying a different distance measure (where e.g. Ants and James are further apart). Why do you think it doesn't take `WRatio` into account? In fact it is the basis for the clustering. You can print `spatial.distance.squareform(sim)` to see the distance matrix. – Stef May 26 '20 at 18:58
  • 2
    for instance, using Damerau-Levenshtein distance (https://pypi.org/project/pyxDamerauLevenshtein), we get the following groups: `'James,James 2,James Marsh', 'Bike,Bicycle,Mike K', 'Ants,Job,Arm', 'Michael,Michael k'`. – Stef May 26 '20 at 19:26