R: Correct strings by distance measure (stringdistmatrix)

Question

I am dealing with the problem that I need to count unique names of people in a string, but taking into consideration that there may be slight typos. My thought was to set strings below a certain threshold (e.g. levenshtein distance below 2) as being equal. Right now I manage to calculate the string distances, but not making any changes to my input string that would get me the correct number of unique names.

library(stringdist);library(stringr)
names<-"Michael, Liz, Miichael, Maria"
names_split<-strsplit(names, ", ")[[1]]
stringdistmatrix(names_split,names_split)
     [,1] [,2] [,3] [,4]
[1,]    0    6    1    5
[2,]    6    0    7    4
[3,]    1    7    0    6
[4,]    5    4    6    0
(number_of_people<-str_count(names, ",")+1)
[1] 4

The correct value of number_of_people should be, of course, 3.

As I am only interested in the number of uniques names, I am not concerned if "Michael" becomes replaced by "Miichael" or the other way round.

Not sure the problem is well defined. Consider these names: Maria, Mara, Sara, Sarah. Maria and Sarah have a distance >2, but each successive pair has a distance 1. Also, most people would think that that name list contains 3 unique names. — Claus Wilke, Dec 16 '17 at 19:37

score 0 · Answer 1 · answered Jan 04 '18 at 10:40

One option is to try to cluster the names based on their distance matrix:

library(stringdist)
# create a 'dist' object (=lower triangular part of distance matrix)
d <- stringdistmatrix(names_split,method="osa")
# use hierarchical clustering to group nearest neighbors
hc <- hclust(d)
# visual inspection: y-axis labels the distance value
plot(hc)
# decide what distance value you find acceptable for grouping.
cutree(hc, h=3)

Depending on your actual data you will need to experiment with the distance type (qgrams/cosine may be useful, or the jaro-winkler distance in the case of names).

R: Correct strings by distance measure (stringdistmatrix)

1 Answers1