I would like to ask on how to remove duplicate approximate word matching using fuzzy in python or ANY METHOD that is feasible. I have an excel that contains approximate similar name, at this point, I would like to remove the name that contains high similarity and remain only one name.
For instance, here is the input (excel file), there is 6 rows and 5 columns in total:
|-------------------|-----|-----|-----|-----|-----|
| abby_john | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|
| abby_johnny | def | def | def | def | def |
|-------------------|-----|-----|-----|-----|-----|
| a_j | ghi | ghi | ghi | ghi | ghi |
|-------------------|-----|-----|-----|-----|-----|
| abby_(john) | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|
| john_abby_doe | def | def | def | def | def |
|-------------------|-----|-----|-----|-----|-----|
| aby_/_John_Doedy | ghi | ghi | ghi | ghi | ghi |
|-------------------|-----|-----|-----|-----|-----|
Although all the above of name looks different, they actually is the same, how should python know they all are the same and remove duplicated name and remains ANY ONE of name and remains it's entire row? By the way, the input file is in Excel file format (.xlsx).
Desired output:
|-------------------|-----|-----|-----|-----|-----|
| abby_john | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|
Since the underscore is not very important, it can be replaced with 'spacing', thus another output as following is acceptable: Another desired output:
|-------------------|-----|-----|-----|-----|-----|
| abby_john | abc | abc | abc | abc | abc |
|-------------------|-----|-----|-----|-----|-----|
Appreciate a lot if anyone can help me out, thanks!