I am dealing with the problem that I need to count unique names of people in a string, but taking into consideration that there may be slight typos. My thought was to set strings below a certain threshold (e.g. levenshtein distance below 2) as being equal. Right now I manage to calculate the string distances, but not making any changes to my input string that would get me the correct number of unique names.
library(stringdist);library(stringr)
names<-"Michael, Liz, Miichael, Maria"
names_split<-strsplit(names, ", ")[[1]]
stringdistmatrix(names_split,names_split)
[,1] [,2] [,3] [,4]
[1,] 0 6 1 5
[2,] 6 0 7 4
[3,] 1 7 0 6
[4,] 5 4 6 0
(number_of_people<-str_count(names, ",")+1)
[1] 4
The correct value of number_of_people should be, of course, 3.
As I am only interested in the number of uniques names, I am not concerned if "Michael" becomes replaced by "Miichael" or the other way round.