I am trying to create a distance matrix (to use for clustering) for strings based on customized distance function. I ran the code on a list of 6000 words and it is still running since last 90 minutes. I have 8 GB RAM and Intel-i5, so the problem is with the code only. Here is my code:
library(stringdist)
#Calculate distance between two monograms/bigrams
stringdist2 <- function(word1, word2)
{
#for bigrams - phrases with two words
if (grepl(" ",word1)==TRUE) {
#"Hello World" and "World Hello" are not so different for me
d=min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
#for monograms(words)
else{
#add penalty of 5 points if first character is not same
#brave and crave are more different than brave and bravery
d=ifelse(substr(word1,1,1)==substr(word2,1,1),
stringdist(word1,word2),
stringdist(word1,word2)+5)
}
d
}
#create distance matrix
stringdistmat2 = function(arr)
{
mat = matrix(nrow = length(arr), ncol= length(arr))
for (k in 1:(length(arr)-1))
{
for (j in k:(length(arr)-1))
{
mat[j+1,k] = stringdist2(arr[k],arr[j+1])
}
}
as.dist(mat)
}
test = c("Hello World","World Hello", "Hello Word", "Cello Word")
mydmat = stringdistmat2(test)
> mydmat
1 2 3
2 1
3 1 2
4 2 3 1
I think issue could be that I used loops instead of apply - but then I found at many places that loops are not that inefficient. More importantly I am not skilled enough to use apply for my loops are nested loops are like k in 1:n
and j in k:n
. I wonder if there are other things which can be optimized as well.