0

I have a huge data table with millions of rows that consists of Merchandise code with its description. I want to assign a category to each group (based on the combination of code and description). The problem is that the description is spelled in different ways and I want to convert all the similar names into a single one. Here is an illustrative example:

ibrary(data.table)
dt <- data.table(code = c(rep(1,2),rep(2,2),rep(3,2)), name = c('McDonalds','Mc 
Dnald','Macys','macy','Comcast','Com-cats'))
dt[,cat:='NA']
setkeyv(dt,c('code','name'))
dt[.(1,'McDonalds'),cat:='Restaurant']
dt[.(1,'Mc Dnald'),cat:='Restaurant']
dt[.(1,'Macys'),cat:='Department Store']

Of course in the real case, it is impossible to go through all the spelling that refer to the same word and fix them manually. Is there a way to detect all the similar words and convert them to a single (correct) spelling?

Thanks in advance

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
H_A
  • 667
  • 2
  • 6
  • 13
  • 1
    Maybe has a try on package `stringdist`. You won't get a precise answer unless go through manually though, – Frank Zhang Mar 21 '21 at 09:36
  • 1
    I second the thought of @FrankZhang. I'm doing exactly, the same in one of my real world cases. Idea is that you a) create a lookup list of the most common spellings of a company, like "MCD", "Macys" etc. and assign your category as a secodn column. b) you'll use the stringdist package to calculate distance measures of your input words to your lookup list. Note that this will take long if you really have millions of rows (and it will become very slow if your lookup list is incredibly long. c) you decide on a distance threshold until which you'd assign your input words to a certain category... – deschen Mar 21 '21 at 10:12
  • ...I had good experiences with a distance of 0.1-0.2 and using teh jaro-winkler algorithm. d) in the end you should probably scan through at least an example of your data to check if the assignment of your categories was correct. Of course, there always will be fals positives and false negatives. – deschen Mar 21 '21 at 10:13
  • A computational more efficient idea could be to first assign categories to all the correct spellings and then only use the stringdist approach for those cases that have wrong spellings. – deschen Mar 21 '21 at 10:15
  • Thanks a lot. I am discovering this option – H_A Mar 21 '21 at 10:28

0 Answers0