0

Following code helps to understand number of optimal clusters.

set.seed(123)

# function to compute total within-cluster sum of square 
wss <- function(k) {
  kmeans(df, k, nstart = 10 )$tot.withinss
}

# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15

# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)

plot(k.values, wss_values,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")

Reference: https://uc-r.github.io/

Goal is to convert this to run in a shared memory with multiple cores so that it gets done fast. fviz_nbclust tried using this method and its extremely slow.

Approach/Attempt:

First, create wss method to be called in mclapply

parallel.wss <- function(i, k) {
    set.seed(101)
    kmeans(df, k, nstart=i)$tot.withinss
}

here i is number of parallel starts , k is actually k.values which is number of cluster we need to try out to find the optimal.

k.values <- 1:15

kmean_results <- mclapply(c(25,25,25,25), k.values, FUN=parallel.wss)

but got following warning:

Warning message:
In mclapply(c(25, 25, 25, 25), k.values, FUN = parallel.wss) :
  all scheduled cores encountered errors in user code

looking at the kmean_results object:

head(kmean_results) [[1]] [1] "Error in kmeans(df, k, nstart = i) : \n must have same number of columns in 'x' and 'centers'\n" attr(,"class") [1] "try-error" attr(,"condition")

add-semi-colons
  • 18,094
  • 55
  • 145
  • 232

1 Answers1

1

With foreach, you can do

ncores <- parallel::detectCores(logical = FALSE)
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
library(foreach)
wss_values2 <- foreach(k = k.values, .combine = 'c') %dopar% {
  kmeans(df, k, nstart = 10)$tot.withinss
}
parallel::stopCluster(cl)

If you wrap the kmeans call in a function, you need to pass all the variables as arguments (df and k).

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • I got 5 values as a result for the `wss_values2`, trying to understand do i now carry out `wss_values <- map_dbl(k.values, wss)` on a single core. – add-semi-colons Feb 16 '18 at 18:00
  • 1
    Not sure I understand? – F. Privé Feb 16 '18 at 18:25
  • none parallel code that i posted in the question should lead to following plot `https://uc-r.github.io/kmeans_clustering` trying to figure the results from parallel version. Also this ran in 30seconds on 30,000 data set – add-semi-colons Feb 16 '18 at 18:50
  • i should get getting 15 cluster center values but only get one value in `wss_values2` with my data set also tried the code with iris data and still gets me only 1 value. Do you get 15 values in `wss_values2` – add-semi-colons Feb 19 '18 at 15:47
  • Yes I do get 15 values – F. Privé Feb 19 '18 at 17:21
  • ok, thats the problem i guess, i am only getting 1 value trying to figure the reason thanks – add-semi-colons Feb 19 '18 at 17:36
  • 1
    Trying running my code with `df <- iris[-5]` and `k.values <- 1:15`. – F. Privé Feb 19 '18 at 17:40
  • Thats exactly what i did... this is strange, i am running `[1] "R version 3.4.3 (2017-11-30)"` took code directly from here... – add-semi-colons Feb 19 '18 at 17:50
  • unbelievable was running on the command line and fire up RStudio ran the code and 15 clusters showed up. I can't figure the issue why terminal is acting up differently for the same code. – add-semi-colons Feb 19 '18 at 17:57