R: Consolidate different spellings of the same entry into one

Question

I have a data set that is sorted by company names. Sometimes the names are misspelled and show as unique entries:

Name
ABC Company
ABc Company
DEF Company
def compANY
Ddf Cmpany
abC comPany

In fact, these entries are variations of the same two company names. This is clearly a problem with my initial data set but I need to take care of it to process my data correctly.

Name
ABC Company
DEF Company

I don't know how I can approach this, other than long loops that test modified versions of the words against a dictionary-like data structure. Is there a library for spellchecking (and would that even make sense for company names)?

I'd appreciate any help and don't have a preference for any package. Thank you.

How does one know that there is no company called `Ddf Cmpany`? Do you have a list of all possible companies? — G5W, Jun 15 '20 at 16:06

score 2 · Accepted Answer · answered Jun 15 '20 at 16:18

2

You can use adist to get the Approximate String Distances which can be used in hclust to get clusters which can be classified in groups with cutree.

hc <- hclust(as.dist(adist(Name, ignore.case = TRUE)))
Name[!duplicated(cutree(hc,k=2))] #For two groups
#[1] "ABC Company" "DEF Company"

Data:

Name <- c("ABC Company","ABc Company","DEF Company","def compANY","Ddf Cmpany","abC comPany")

answered Jun 15 '20 at 16:18

GKi

37,245
2
26
48

Thanks @GKi, that helps! – questionmark Jun 15 '20 at 18:57

R: Consolidate different spellings of the same entry into one

1 Answers1