6

I am trying to create a distance matrix (to use for clustering) for strings based on customized distance function. I ran the code on a list of 6000 words and it is still running since last 90 minutes. I have 8 GB RAM and Intel-i5, so the problem is with the code only. Here is my code:

library(stringdist)
#Calculate distance between two monograms/bigrams
stringdist2 <- function(word1, word2)
{
    #for bigrams - phrases with two words
    if (grepl(" ",word1)==TRUE) {
        #"Hello World" and "World Hello" are not so different for me
        d=min(stringdist(word1, word2),
        stringdist(word1, gsub(word2, 
                          pattern = "(.*) (.*)", 
                          repl="\\2,\\1")))
    }
    #for monograms(words)
    else{
        #add penalty of 5 points if first character is not same
        #brave and crave are more different than brave and bravery
        d=ifelse(substr(word1,1,1)==substr(word2,1,1),
                            stringdist(word1,word2),
                            stringdist(word1,word2)+5)
    }   
    d
}
#create distance matrix
stringdistmat2 = function(arr)
{
    mat = matrix(nrow = length(arr), ncol= length(arr))
    for (k in 1:(length(arr)-1))
    {
        for (j in k:(length(arr)-1))
        {           
            mat[j+1,k]  = stringdist2(arr[k],arr[j+1])      
        }
    }
    as.dist(mat)    
}

test = c("Hello World","World Hello", "Hello Word", "Cello Word")
mydmat = stringdistmat2(test)
> mydmat
  1 2 3
2 1    
3 1 2  
4 2 3 1

I think issue could be that I used loops instead of apply - but then I found at many places that loops are not that inefficient. More importantly I am not skilled enough to use apply for my loops are nested loops are like k in 1:n and j in k:n. I wonder if there are other things which can be optimized as well.

Gaurav Singhal
  • 998
  • 2
  • 10
  • 25
  • Waiting here and at R screen did not seem right so I opened one more R console and tried this: `arr1 = c("Hello World","World Hello", "Hello Word", "Cello Word") mytest = function(arr1){as.dist(sapply(arr1,stringdist,b=arr1))} mytest(arr1)` It gave me desired distance matrix. Now I am wondering how to change my function so that it works on vectors – Gaurav Singhal Sep 02 '15 at 09:36

3 Answers3

4

Interesting question. So going step by step:

1 - stringdist function is already vectorized:

#> stringdist("byye", c('bzyte','byte'))
#[1] 2 1

#> stringdist(c('doggy','gadgy'), 'dodgy')
#[1] 1 2

But giving two vectors with the same length, stringdist will result in looping parallelly on each vector (not resulting in a matrix with cross results), as Map would do:

#> stringdist(c("byye","alllla"), c('bzyte','byte'))
#[1] 2 6

2 - Rewrite your function so that your new function keeps this vectorized feature:

stringdistFast <- function(word1, word2)
{
    d1 = stringdist(word1, word2)
    d2 = stringdist(word1, gsub("(.+) (.+)", "\\2 \\1", word2))

    ifelse(d1==d2,d1+5*(substr(d1,1,1)!=substr(d2,1,1)),pmin(d1,d2))
}

It is indeed working the same way:

#> stringdistFast("byye", c('bzyte','byte'))
#[1] 2 1

#> stringdistFast("by ye", c('bzyte','byte','ye by'))
#[1] 3 2 0

3 - Rewrite the dismatrix function with only one loopy loop and only on a triangular part (no outer there, it's slow!):

stringdistmatFast <- function(test)
{
    m = diag(0, length(test))
    sapply(1:(length(test)-1), function(i)
    {
        m[,i] <<- c(rep(0,i), stringdistFast(test[i],test[(i+1):length(test)]))
    }) 

    `dimnames<-`(m + t(m), list(test,test))
}

4 - Use the function:

#> stringdistmatFast(test)
#            Hello World World Hello Hello Word Cello Word
#Hello World           0           0          1          2
#World Hello           0           0          1          2
#Hello Word            1           1          0          1
#Cello Word            2           2          1          0
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
3

Loops are indeed very inefficient, and here is a quick example that shows that:

x=rnorm(1000000)
system.time({y1=sum(x)})
system.time({
        y2=0
        for(i in 1:length(x)){
                y2=y2+x[i]
        }
})

This is a simple comparison of internal vectorised function sum(), that essentially just calculates sum of all elements in a cycle internally; second function does the same in R code, which makes it call another internal function + over and over, which is not very efficient.

First of all, you have a couple of mistakes/inconsistencies in your user defined function. This part: gsub(word2, pattern = "(.*) (.*)", repl="\\2,\\1") replaces all white spaces with comas, which automatically adds +1 to distance score (was it intended?) Second of all, you don't compare first letters for strings that have spaces in them, because then only the first part of the function is executed. That is true even if only the first of the compared words contains space, so "Hello " and "Cello" comparison would be calculated as closer distance than "Hello" and "Cello".

Other then that, your code seems to be easy vectorisable, because all the functions you use are already vectorised: stringdist(),grepl(),gsub(),substr() etc. Basically you perform 3 calculations for each word-pair: simple 'stringdist()', stringdist() of swapped words (if there is space in the first word), and simple comparison of first letters that adds +5 points if they are different.

Here is the code that reproduces your function in a vectorised manner, which gives around 50x speed up on calculating 300x300 matrix:

stringdist3<-function(words1,words2){
m1<-stringdist(words1,words2)
m2<-stringdist(words1,gsub(words2, 
                           pattern = "(.*) (.*)", 
                           repl="\\2,\\1"))
m=mapply(function(x,y) min(x,y),m1,m2)

m3<-5*(substr(words1,1,1)!=substr(words2,1,1) & !grepl(" ",words1))

m3+m
}
stringdistmat3 = function(arr){
        outer(arr,arr,function(x,y) stringdist3(x,y))
}
test = c("Hello World","World Hello", "Hello Word", "Cello Word")
arr=sample(test,size=300,replace=TRUE)
system.time({mat = stringdistmat2(arr)})
system.time({
        mat2=stringdistmat3(arr)
        })
Maksim Gayduk
  • 1,051
  • 6
  • 13
  • Ha I have a similar answer but think it's getting faster ;) Nice post anyway, +1 – Colonel Beauvel Sep 02 '15 at 11:04
  • Thanks for the answer, for pointing out the mistakes and for the education on the loops inefficiency. Comma instead of space was not intended (a typo) and yes I need to make better version for bigrams. I will try your version and let you know. I am also creating my own vectorized version and will compare that with yours. – Gaurav Singhal Sep 02 '15 at 11:53
  • 3
    Actually the main reason for your loop demonstration being slow is due to growing objects without pre allocating memory, rather due to the loop itself. `data.table` for example, gains lighting performance when combining `for` loops with the `set` function. – David Arenburg Sep 02 '15 at 12:02
  • @Maksim, your solution was faster on the test data, but it crashed R when I ran it on actual data. I think it may run successfully if I close everything, but as Colonel's solution worked on actual data and was faster on test data, I will go with that. Thanks for all the help – Gaurav Singhal Sep 03 '15 at 05:58
0

I was also trying to create an alternate way to improve my answer. Basically I removed the function to create distance and directly created matrix of distances.So here is what I came up with. I know this solution can be improved. So any suggestions are welcomed

strdistmat2 <- function(v1,v2,type="m"){
    #for monograms
    if (type=="m")  {
        penalty = sapply(substr(v1,1,1),stringdist,b=substr(v2,1,1)) * 5
        d = sum(sapply(v1,stringdist,b=v2),penalty)
    }
    #for bigrams
    else if(type=="b")  {       
        d1 = sapply(v1,stringdist,b=v2) 
        d2 = sapply(v1,stringdist,b=gsub(v2,pattern = "(.*) (.*)", repl="\\2 \\1"))
        d = pmin(d1,d2)
    }
    d
}

I have compared the times for various solutions below.

> test = c("Hello World","World Hello", "Hello Word", "Cello Word")
> arr=sample(test,size=6000,replace=TRUE)
> system.time({mat=strdistmat2(arr,arr,"b")})
   user  system elapsed 
  96.89    1.63   70.36 
> system.time({mat2=stringdistmat3(arr)})
   user  system elapsed 
 469.40    5.69  439.96 
> system.time({mat3=stringdistmatFast(arr)})
   user  system elapsed 
  57.34    0.72   41.22 

Therefore - Colonel answer is the fastest.

Also on the actual data, both mine and Maksim code crashed only colonel's answer worked. Here are the results

> system.time({mat3=stringdistmatFast(words)})
   user  system elapsed 
 314.63    1.78  291.94 

When I ran my solution on actual data - error message was - could not allocate a vector of 684 MB and on running Maksim's solution - R stopped working.

Gaurav Singhal
  • 998
  • 2
  • 10
  • 25