3

I am trying to clean data using fuzzy match. The df like:

category description
1        almnd
1        almond
2        choc
2        choco

I want to have all similar descriptions to be same one under same category like that:

category description
1        almnd
1        almnd
2        choc
2        choc
pythonic833
  • 3,054
  • 1
  • 12
  • 27

2 Answers2

2

Fuzzy-wuzzy might be not up to such a task. You basically need to cluster words on similarity. Find few suggestion and code examples

https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups

If you find the amount of word and ideas excessive, for easy solution try Gensim most_similar function

Python: clustering similar words based on word2vec

Serge
  • 3,387
  • 3
  • 16
  • 34
1

Convert your dataframe to a dictionary, and remap.

dico = dict(df.to_dict('split')['data'])
df['description'] = pd["category"].map(dico)

If your dataframe actually has more than these two columns check the accepted answer on dictionary extraction.

dataframe to dict such that one column is the key and the other is the value

Serge
  • 3,387
  • 3
  • 16
  • 34