I would like to speed up a distance calculation. I have already put effort into parallelizing it. Unfortunately it still takes longer than an hour.
Basically, the distance between a vector i and j is computed via manhattan distance. The distances between possible values of the vectors is given in the matrix Vardist
. Vardist[i[1],j[1]]
is the distance between the two values i[1]
and j[1]
. (the matrix is indexed by characters in i[1]
and j[1]
respectively)
There is one more important addition for the distance computation. The distance between vector i and j is the minimum over all manhattan distances between vector i and any possible permutation of vector j. This makes it computationally heavy the way it is programmed.
I have 1000 objects to compare with another. Furthermore each object is a vector of length 5. So there will be 120 permutations for each vector.
distMatrix <- foreach(i = 1:samplesize,
.combine = cbind,
.options.snow=opts,
.packages = c("combinat")) %dopar%
{
# inititalizing matrix
dist <- rep(0,samplesize)
# get values on customer i
ValuesCi <- as.matrix(recodedData[i,])
# Remove unecessary entries in value distance matrix
mVardist <- Vardist[ValuesCi,]
for(j in i:samplesize){
# distance between vector i and all permutations of vector j is computed
# minimum of above all distances is taken as distance between vector i and j
dist[j] <- min(unlist(permn(recodedData[j,],
function(x){ pdist <- 0
#nvariables is length of each vector
for(i in 1:nvariables){
pdist <- pdist + mVardist[i,as.matrix(x)[i]]
}
return(pdist)} )))
}
dist
}
Any tips or suggestions are greatly appreciated!