6

I am trying to understand how to parallelize some of my code using R. So, in the following example I want to use k-means to cluster data using 2,3,4,5,6 centers, while using 20 iterations. Here is the code:

library(parallel)
library(BLR)

data(wheat)

parallel.function <- function(i) {
    kmeans( X[1:100,100], centers=?? , nstart=i )
}

out <- mclapply( c(5, 5, 5, 5), FUN=parallel.function )

How can we parallel simultaneously the iterations and the centers? How to track the outputs, assuming I want to keep all the outputs from k-means across all, iterations and centers, just to learn how?

svick
  • 236,525
  • 50
  • 385
  • 514
hema
  • 725
  • 1
  • 8
  • 20
  • Another option is using the [biganalytics package](http://cran.r-project.org/web/packages/biganalytics/biganalytics.pdf) In page 4 you can find the ```bigkmeans()``` function. – marbel Jan 04 '14 at 23:46

3 Answers3

6

This looked very simple to me at first ... and then i tried it. After a lot of monkey typing and face palming during my lunch break however, I arrived at this:

library(parallel)
library(BLR)

data(wheat)

mc = mclapply(2:6, function(x,centers)kmeans(x, centers), x=X)

It looks right though I didn't check how sensible the clustering was.

> summary(mc)
     Length Class  Mode
[1,] 9      kmeans list
[2,] 9      kmeans list
[3,] 9      kmeans list
[4,] 9      kmeans list
[5,] 9      kmeans list

On reflection the command syntax seems sensible - although a lot of other stuff that failed seemed reasonable too...The examples in the help documentation are maybe not that great.

Hope it helps.

EDIT As requested here is that on two variables nstart and centers

(pars = expand.grid(i=1:3, cent=2:4))

  i cent
1 1    2
2 2    2
3 3    2
4 1    3
5 2    3
6 3    3
7 1    4
8 2    4
9 3    4

L=list()
# zikes horrible
pars2=apply(pars,1,append, L)
mc = mclapply(pars2, function(x,pars)kmeans(x, centers=pars$cent,nstart=pars$i ), x=X)

> summary(mc)
      Length Class  Mode
 [1,] 9      kmeans list
 [2,] 9      kmeans list
 [3,] 9      kmeans list
 [4,] 9      kmeans list
 [5,] 9      kmeans list
 [6,] 9      kmeans list
 [7,] 9      kmeans list
 [8,] 9      kmeans list
 [9,] 9      means list

How'd you like them apples?

Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33
  • Stephen Henderson, Thank you so much for your answer -- However the challenge, at least for me, is to simultaneously parallel the iterations and the number of clusters i.e " kmeans(x, centers,nstart =?) " Again thank you & i appreciate your help – hema Dec 06 '13 at 14:51
  • @hema Challenge Accepted! – Stephen Henderson Dec 06 '13 at 15:38
  • NB note for sensible speed up you should control how many cores you are actually using based on what you have and a bit of testing... – Stephen Henderson Dec 06 '13 at 15:49
  • Stephen Henderson : Very interesting answer, i learned something new from you today. I will apply your idea in one of my functions that required 2 for loops " takes forever". I will accept your answer late today. – hema Dec 06 '13 at 15:54
  • Stephen Henderson: can we exchange emails, i am trying to apply what you just did in my "real life function"--- it looks like i am missing something. Can i share i did with you then we can work this problem together -- Here is my email : ielbasyoni@gmail, i will understand if you don't have time. thanks again – hema Dec 06 '13 at 17:18
  • @hema my mail is on my profile. If you send a brief reproducible version, I'll have a look.. No promises though, I haven't used mclapply much either. – Stephen Henderson Dec 06 '13 at 17:29
5

There's a CRAN package called knor that is derived from a research paper that improves the performance using a memory efficient variant of Elkan's pruning algorithm. It's an order of magnitude faster than everything in these answers.

install.packages("knor")
require(knor)
iris.mat <- as.matrix(iris[,1:4])
k <- length(unique(iris[, dim(iris)[2]])) # Number of unique classes
nthread <- 4
kms <- Kmeans(iris.mat, k, nthread=nthread)
quine
  • 982
  • 1
  • 12
  • 20
  • 1
    Thanks for pointing this out. Knor is fast! I highly recommend this for anyone reading this thread. Now blowing up far fewer HPC nodes for far less time. – zdebruine Oct 17 '20 at 19:41
1

You may use parallel to try K-Means from different random starting points on multiple cores.

The code below is an example. (K=K in K-means, N= number of random starting points, C = number of cores you would like to use)

suppressMessages( library("Matrix") )
suppressMessages( library("irlba") )
suppressMessages( library("stats") )
suppressMessages( library("cluster") )
suppressMessages( library("fpc") )
suppressMessages( library("parallel") )

#Calculate KMeans results
calcKMeans <- function(matrix, K, N, C){
  #Parallel running from various of random starting points (Using C cores)
  results <- mclapply(rep(N %/% C, C), FUN=function(nstart) kmeans(matrix, K, iter.max=15, nstart=nstart), mc.cores=C);
  #Find the solution with smallest total within sum of square error
  tmp <- sapply(results, function(r){r[['tot.withinss']]})
  km <- results[[which.min(tmp)]]  
  #return cluster, centers, totss, withinss, tot.withinss, betweenss, size
  return(km)
}

runKMeans <- function(fin_uf, K, N, C, 
                      #fout_center, fout_label, fout_size, 
                      fin_record=NULL, fout_prediction=NULL){
  uf = read.table(fin_uf)
  km = calcKMeans(uf, K, N, C)
  rm(uf)
  #write.table(km$cluster, file=fout_label, row.names=FALSE, col.names=FALSE)
  #write.table(km$center, file=fout_center, row.names=FALSE, col.names=FALSE)
  #write.table(km$size, file=fout_size, row.names=FALSE, col.names=FALSE)
  str(km)

  return(km$center)
}

Hope it helps!

korolevbin
  • 43
  • 4