There are several approaches to optimize R's performance such as vectorization with BrodieG shown.
Alternatively, you can leverage the performance benefit from existing compute pattern, for instance, matrix multiplication, sorting. Further reading for pattern in this book
[Michael McCool, etc, Structured Parallel Programming – Patterns for Efficient Computation].
And we also can get further performance benefit from parallel libraries for these existing patterns from multicore CPU or GPU. In this case, we can represent the computation by matrix operations or KNN.
1. Profiling
Profiling this piece of code by system.time for a large data set input (1e6), we can see the first loop, which computes the distance between two vectors, accounts for 95% of total computing time. So, our optimization will start from here.
# Original code
vals <- 50
clusts <- 6
ClusterCenters <- matrix(runif(vals * clusts), nrow=clusts)
data.count <- 1e6 # large number
calcData <- matrix(runif(data.count * vals), nrow=data.count)
system.time({
for(i in 1:nrow(ClusterCenters)) {
dists[i,] <- (rowSums((matrix(unlist(apply(calcData, 1, function(x) {x ClusterCenters[i,]})), ncol = ncol(calcData), byrow = TRUE))^2))^0.5
}
})
user system elapsed
71.62 1.13 73.13
system.time({
for(i in 1: nrow(calcData)) {
ClusterMemberships[i] <- which.min(dists[,i])
}
})
user system elapsed
5.29 0.00 5.31
2. Vectorization
Vectorization is an absolutely useful method to accelerate R code especially for ‘loop’, as @BrodieG shown. BTW, I have modified a little in his solution for getting the correct results as below, and it can gain about 3-5X speedup than the original code.
#Vectorization: From BrodieG
dists1 <-matrix(NA, nrow = nrow(ClusterCenters), ncol = nrow(calcData)) system.time({
dists1 <- apply(ClusterCenters, 1, function(x) rowSums(sweep(calcData, 2,x, '-') ^ 2) ^ .5)
min.dist.vec <- max.col(-dists1, ties.method="first")
})
user system elapsed
16.13 1.42 17.61
all.equal(ClusterMemberships, min.dist.vec)
[1] TRUE
3. Matrix Pattern
And then, let’s review the first loop, it computes the distance by summarizing columns of (calcData[i,] – ClusterCenters[j,])^2.
So, we can transfer this operation to matrix by expanding this equation as below:
calcData[i, ]^2 – 2 * calcData[i, ] * ClusterCenters[j, ] +
ClusterCenters[j,]^2
Thus, for the first and third part, we can do simple matrix sweep multiplication , such as
calcData * calcData
And for the second item, we need a trick skill of matrix transfer, then it changes to a matrix multiplication of
ClusterCenters %*% t(calcData)
Finally, the whole code with matrix operations as below:
# Pattern Representation 1: Matrix
dists2 <-matrix(NA, nrow = nrow(ClusterCenters), ncol = nrow(calcData))
system.time({
data2 <- rowSums(calcData*calcData)
clusters2 <- rowSums(ClusterCenters*ClusterCenters)
# NVIDIA GPU: nvBLAS can speedup this step
# Futher Info on ParallelR.com
ClustersXdata <- calcData %*% t(ClusterCenters)
# compute distance
dists2 <- sweep(data2 - 2 * ClustersXdata, 2, clusters2, '+') ^0.5
min.dist.matrix <- max.col(-dists2, ties.method="first")
})
user system elapsed
1.17 0.09 1.28
all.equal(ClusterMemberships, min.dist.matrix)
[1] TRUE
Now, we can see all these three parts can be done by matrix operations. By this method, the computation time almost is linear from 10^3 to 10^7, and its about 50X faster than original code for 1e6 calcData set.

4. KNN
In this case, the computation can be regarded as finding 1st nearest neighbor in cluster data set so it's a kind of simplest KNN with k=1. And we can easy to use knn function from the class package which is written by C. Meanwhile, it will also very efficient. The example code as below:
# Pattern Representation 2: KNN
library("class")
system.time(
min.dist.knn <- knn(ClusterCenters, calcData, cl = 1:nrow(ClusterCenters), k = 1)
)
user system elapsed
1.21 0.12 1.35
all.equal(ClusterMemberships, as.integer(min.dist.knn))
[1] TRUE
The runtime of KNN is as similar as our matrix operation code for 1e6, but if we apply more big data to these two algorithms, we can see the matrix algorithm is still won and the matrix algorithm is 2X fast than KNN (15.9 .vs. 29.1).
Finally, in this post I just show several ideas for performance tuning, and we can continue fine tune this code including architecture specified optimizations and using c/c++ to rewrite it. Even parallelize matrix operation and KNN on multi-core CPU and NVIDIA GPU, you can refer ParallelR