I have a large vector of strings like this:
d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
I wan't to fetch similar strings for each string from the same vector d.
I am doing this by
1. calculating for each string the edit distance with all other strings strings based on certain rules such as forcing exact matching if any digits are present or if number of alphabet characters are less than 5.
2. putting it in a dataframe dist along with string.
3. subsetting dist based on distances < 3.
4. collapsing and adding the similar strings to original dataframe as a new column.
I am using the stringr
and stringdist
packages
d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)
require(stringr)
require(stringdist)
for (i in 1:M){
# if string has digits or is of short size (<5) do exact matching
if (grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE || str_count(d[i, "d"], "[[:alpha:]]") < 5){
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=0.000001) # maxDist as fraction to force exact matching
# otherwise do approximate matching
} else {
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=3)
}
# subset similar strings (with edit distance <3)
subDist <- subset(Dist, dist < 3 )
# add to original data.frame d
d[i, "sim"] <- paste(as.character(unlist(subDist$string)), collapse=", ")
}
Is it possible to vectorise the procedure instead of using a loop? I have a very large vector of strings, so a calculating a distance matrix using stringdistmatrix
on the entire vector can't be done due to memory restrictions. The loop works fine for large data, but is very slow.