2

Hi guys I have a following problem I have cluster centers in some dimmentions (4-6 clusters) and a very large dataset that I need to assign each row to the closest cluster. So it's not really a question of distance but of performance my code is as follows:

distances <- matrix(NA, nrow = nrow(ClusterCenters), ncol = nrow(data))
calcData <- data[, colnames(ClusterCenters), drop=FALSE]
for(i in 1:nrow(ClusterCenters)) {
    distances[i,] <- (rowSums((matrix(unlist(apply(calcData, 1, function(x) {x - ClusterCenters[i,]})), ncol = ncol(calcData), byrow = TRUE))^2))^0.5
}
ClusterMemberships <- vector(mode="numeric", length=nrow(calcData))
for(i in 1: nrow(calcData)) {
  ClusterMemberships[i] <- which.min(distances[,i])
}
return(ClusterMemberships)

Is there a way to speed it up? I work on windows server.

VoytechM
  • 65
  • 6

2 Answers2

2

For a 50 x 1 million data row matrix matching against six clusters with 50 values each, I get the result in about 3 seconds:

vals <- 50
clusts <- 6
clusters <- matrix(runif(vals * clusts), nrow=clusts)

data.count <- 1e6  # large number
data <- matrix(runif(data.count * vals), nrow=data.count)

system.time({
  dists <- apply(clusters, 1, function(x) rowSums((data - x) ^ 2) ^ .5)
  min.dist <- max.col(-dists, ties.method="first")
})
# user  system elapsed 
# 2.96    0.47    3.49 

The key thing is to make sure that the we limit the number of R function calls as these get expensive. Notice how I apply over the clusters (of which there are only six) instead of over the data rows, of which there are a million. I then use recycling to compute the distance for each cluster against the entire set (note data is transposed compared to your data, there are as many rows as there are items in the cluster; this is necessary for recycling to work).

Credit to @user20650 for providing the max.col piece.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • 'dists <- apply(clusters, 1, function(x) rowSums((data - x) ^ 2) ^ .5)' Doesn't return good values if you have more than 1 row in clusters – VoytechM Jan 30 '15 at 09:25
  • @wafflecop, provide a small sample data set with the expected output then so we can be on the same page. – BrodieG Jan 30 '15 at 13:21
2

There are several approaches to optimize R's performance such as vectorization with BrodieG shown.

Alternatively, you can leverage the performance benefit from existing compute pattern, for instance, matrix multiplication, sorting. Further reading for pattern in this book

[Michael McCool, etc, Structured Parallel Programming – Patterns for Efficient Computation].

And we also can get further performance benefit from parallel libraries for these existing patterns from multicore CPU or GPU. In this case, we can represent the computation by matrix operations or KNN.

1. Profiling

Profiling this piece of code by system.time for a large data set input (1e6), we can see the first loop, which computes the distance between two vectors, accounts for 95% of total computing time. So, our optimization will start from here.

# Original code
vals <- 50
clusts <- 6
ClusterCenters <- matrix(runif(vals * clusts), nrow=clusts)

data.count <- 1e6  # large number
calcData <- matrix(runif(data.count * vals), nrow=data.count)

system.time({
   for(i in 1:nrow(ClusterCenters)) {
     dists[i,] <- (rowSums((matrix(unlist(apply(calcData, 1, function(x) {x      ClusterCenters[i,]})), ncol = ncol(calcData), byrow = TRUE))^2))^0.5
   }
})
user  system elapsed 
71.62    1.13   73.13  

system.time({
   for(i in 1: nrow(calcData)) {
     ClusterMemberships[i] <- which.min(dists[,i])
   }
})
user  system elapsed 
5.29    0.00    5.31 

2. Vectorization

Vectorization is an absolutely useful method to accelerate R code especially for ‘loop’, as @BrodieG shown. BTW, I have modified a little in his solution for getting the correct results as below, and it can gain about 3-5X speedup than the original code.

#Vectorization:  From BrodieG
dists1 <-matrix(NA, nrow = nrow(ClusterCenters), ncol = nrow(calcData))  system.time({
  dists1 <- apply(ClusterCenters, 1, function(x) rowSums(sweep(calcData, 2,x, '-') ^ 2) ^ .5)
  min.dist.vec <- max.col(-dists1, ties.method="first")
})
user  system elapsed 
16.13    1.42   17.61  
all.equal(ClusterMemberships, min.dist.vec)
[1] TRUE

3. Matrix Pattern

And then, let’s review the first loop, it computes the distance by summarizing columns of (calcData[i,] – ClusterCenters[j,])^2.

So, we can transfer this operation to matrix by expanding this equation as below:

calcData[i, ]^2 – 2 * calcData[i, ] * ClusterCenters[j, ] + ClusterCenters[j,]^2

Thus, for the first and third part, we can do simple matrix sweep multiplication , such as

calcData * calcData

And for the second item, we need a trick skill of matrix transfer, then it changes to a matrix multiplication of

ClusterCenters %*% t(calcData)

Finally, the whole code with matrix operations as below:

# Pattern Representation 1: Matrix 
dists2 <-matrix(NA, nrow = nrow(ClusterCenters), ncol = nrow(calcData))
system.time({
  data2     <- rowSums(calcData*calcData)
  clusters2 <- rowSums(ClusterCenters*ClusterCenters)
  # NVIDIA GPU: nvBLAS can speedup this step
  # Futher Info on ParallelR.com
  ClustersXdata <- calcData %*% t(ClusterCenters)
  # compute distance 
  dists2  <- sweep(data2 - 2 * ClustersXdata, 2, clusters2, '+') ^0.5
  min.dist.matrix <- max.col(-dists2, ties.method="first")
})
user  system elapsed 
1.17    0.09    1.28
all.equal(ClusterMemberships, min.dist.matrix)
[1] TRUE

Now, we can see all these three parts can be done by matrix operations. By this method, the computation time almost is linear from 10^3 to 10^7, and its about 50X faster than original code for 1e6 calcData set. Compare three algorithm

4. KNN

In this case, the computation can be regarded as finding 1st nearest neighbor in cluster data set so it's a kind of simplest KNN with k=1. And we can easy to use knn function from the class package which is written by C. Meanwhile, it will also very efficient. The example code as below:

# Pattern Representation 2: KNN 
library("class")
system.time(
  min.dist.knn <- knn(ClusterCenters, calcData, cl = 1:nrow(ClusterCenters), k = 1)
)
user  system elapsed 
1.21    0.12    1.35 

all.equal(ClusterMemberships, as.integer(min.dist.knn))
[1] TRUE

The runtime of KNN is as similar as our matrix operation code for 1e6, but if we apply more big data to these two algorithms, we can see the matrix algorithm is still won and the matrix algorithm is 2X fast than KNN (15.9 .vs. 29.1).

Finally, in this post I just show several ideas for performance tuning, and we can continue fine tune this code including architecture specified optimizations and using c/c++ to rewrite it. Even parallelize matrix operation and KNN on multi-core CPU and NVIDIA GPU, you can refer ParallelR

Patric
  • 2,063
  • 17
  • 18