I have some code that I would like to vectorize but I am not sure how. The following code gives some example data, comprised of names and addreses.
name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md",
"811 quincy st washington dc", "1911 1st st rockville md")
source1 <- data.frame(name, address)
name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner",
"joes crag shack", "mike lowry place", "holiday inn", "zummer")
name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
"1100 21st st nw washington dc", "1804 w 5th st wilmington de",
"1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
"400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)
This block calculates levenshtein distince between two columns of text via R's native adist
function and then applies the min
function.
dist.name<- adist(source1$name,source2$name, partial = TRUE, ignore.case = TRUE)
dist.address <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)
min.name<-apply(dist.name, 2, min)
min.address <- apply(dist.address, 2, min)
I would like to do the following:
- match
source1$name
withsource2$name
based on the minimum levenshtein distance. If the results of 1 yield an NA, match based on
source1$address
andsource2$address
using levenshtein distance. I have tried using a for loop, which works fine for 1 but not 2. Here is the code I used to try and incorporate both:match.s1.s2<-NULL for(i in 1:nrow(dist.name)){ for(j in 1:nrow(dist.address)){ if(is.na(match(min.name[i], dist.name[i, ]))) { s2.i <- match(min.address[j], dist.address[j,]) s1.i <- i match.s1.s2 <- match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, s1name=source1[s1.i,]$name, adist=min.name[j], s1.i.address = source1[s1.i,]$address, s2.i.address = source2[s2.i,]$address),match.s1.s2) } else { s2.i<-match(min.name[i],dist.name[i,]) s1.i<-i match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, s1name=source1[s1.i,]$name, adist=min.name[i], s1.i.address = source1[s1.i,]$address, s2.i.address = source2[s2.i,]$address),match.s1.s2) } } }
My problem is it's slow and it ends up producing a data frame that is much too large. The end result, data frame match.s1.s2
should have the same number of rows as source1. Any advice or help would be much appreciated. Thanks.