K-Means: Lloyd,Forgy,MacQueen,Hartigan-Wong

Question

I'm working with the K-Means Algorithm in R and I want to figure out the differences of the 4 Algorithms Lloyd,Forgy,MacQueen and Hartigan-Wong which are available for the function "kmeans" in the stats package.

However I was notable to get a sufficient answer to this question.

I only found some rarely information: (Visit http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/K-Means)

From this description Lloyd, Forgy, and Hartigan-Wong seem the same to me. Minimizing the within sum of squares or Minimizing the Euclidean Distance is the same.

MacQueen is different in the case that it updates the two involved clusters if a object is moved to another cluster if I'm right.

Nevertheless, I still do not see in which points these Algorithms are different.

You might get a lot more attention at the "Cross Validated" Stack Exchange site. — kdauria, Dec 08 '13 at 00:32

score 30 · Answer 1 · answered Dec 08 '13 at 06:53

R provides Lloyd's algorithm as an option to kmeans(); the default algorithm, by Hartigan and Wong (1979) is much smarter. Like MacQueen's algorithm (MacQueen, 1967), it updates the centroids any time a point is moved; it also makes clever (time-saving) choices in checking for the closest cluster. On the other hand Lloyd's k-means algorithm is the first and simplest of all these clustering algorithms.

Lloyd's algorithm (Lloyd, 1957) takes a set of observations or cases (think: rows of an nxp matrix, or points in Reals) and clusters them into k groups. It tries to minimize the within-cluster sum of squares enter image description here

where u_i is the mean of all the points in cluster S_i. The algorithm proceeds as follows (I'll spare you the formality of the exhaustive notation): enter image description here

There is a problem with R's implementation, however, and the problem arises when considering multiple starting points. I should note that it's generally prudent to consider several dierent starting points, because the algorithm is guaranteed to converge, but is not guaranteed to coverge to a global optima. This is particularly true for large, high-dimensional problems. I'll start with a simple example (large, not particularly dicult).

(Here I will paste some images as we can not write mathematical formulaswith latex)

enter image description here

Note that the solution is very similar to the one achieved earlier, although the ordering of the clusters is arbitrary. More importantly, the job only took 0.199 seconds in parallel! Surely this is too good to be true: using 3 processor cores should, at best, taken one third of the time of our first (sequential) run. Is this a problem? It sounds like free lunch. There is no problem with a free lunch once in a while, is there?

enter image description here

This doesn't always work with R functions, but sometimes we have a chance to look directly at the code. This is one of those times. I'm going to put this code into file, mykmeans.R, and edit it by hand, inserting cat() statements in various places. Here's a clever way to do this, using sink() (although this doesn't seem to work in Sweave, it will work interactively):

> sink("mykmeans.R")
> kmeans
> sink()

Now editing the file, changing the function name and adding cat() statements. Note that you also have to delete a trailing line: :

enter image description here

We can then repeat our explorations, but using mykmeans():

> source("mykmeans.R")
> start.kmeans <- proc.time()[3]
> ans.kmeans <- mykmeans(x, 4, nstart = 3, iter.max = 10, algorithm = "Lloyd")
JJJ statement 1: 0 elapsed time.
JJJ statement 5: 2.424 elapsed time.
JJJ statement 6: 2.425 elapsed time.
JJJ statement 7: 2.52 elapsed time.
JJJ statement 6: 2.52 elapsed time.
JJJ statement 7: 2.563 elapsed time.

enter image description here

Now we're in business: most of the time was consumed before statement 5 (I knew this of course, which is why statement 5 was 5 rather than 2)... You can keep on playing with it

Here is code:

#######################################################################
# kmeans()

N <- 100000
x <- matrix(0, N, 2)
x[seq(1,N,by=4),] <- rnorm(N/2)
x[seq(2,N,by=4),] <- rnorm(N/2, 3, 1)
x[seq(3,N,by=4),] <- rnorm(N/2, -3, 1)
x[seq(4,N,by=4),1] <- rnorm(N/4, 2, 1)
x[seq(4,N,by=4),2] <- rnorm(N/4, -2.5, 1)
start.kmeans <- proc.time()[3]
ans.kmeans <- kmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

these <- sample(1:nrow(x), 10000)
plot(x[these,1], x[these,2], pch=".")
points(ans.kmeans$centers, pch=19, cex=2, col=1:4)

library(foreach)
library(doMC)
registerDoMC(3)
start.kmeans <- proc.time()[3]
ans.kmeans.par <- foreach(i=1:3) %dopar% {
  return(kmeans(x, 4, nstart=1, iter.max=10, algorithm="Lloyd"))
}
TSS <- sapply(ans.kmeans.par, function(a) return(sum(a$withinss)))
ans.kmeans.par <- ans.kmeans.par[[which.min(TSS)]]
ans.kmeans.par$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

sink("mykmeans.Rfake")
kmeans
sink()

source("mykmeans.R")
start.kmeans <- proc.time()[3]
ans.kmeans <- mykmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

#######################################################################
# Diving

x <- read.csv("Diving2000.csv", header=TRUE, as.is=TRUE)
library(YaleToolkit)
whatis(x)

x[1:14,c(3,6:9)]

meancol <- function(scores) {
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}
x$panelmean <- meancol(x$JScore)

x[1:14,c(3,6:9,11)]

meancol <- function(scores) {
  browser()
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}

x$panelmean <- meancol(x$JScore)

Here is description:

Number of cases: 10,787 scores from 1,541 dives (7 judges score each
dive) performed in four events at the 2000 Olympic Games in Sydney,
Australia.

Number of variables: 10.

Description: A full description and analysis is available in an
article in The American Statistician (publication details to be
announced).

Variables:

Event       Four events, men's and women's 3M and 10m.
Round       Preliminary, semifinal, and final rounds.
Diver       The name of the diver.
Country     The country of the diver.
Rank        The final rank of the diver in the event.
DiveNo      The number of the dive in sequence within round.
Difficulty  The degree of difficulty of the dive.
JScore      The score provided for the judge on this dive.
Judge       The name of the judge.
JCountry    The country of the judge.

And dataset to experiment with it https://www.dropbox.com/s/urgzagv0a22114n/Diving2000.csv

Thanks for this elaborated answer! Especially the part with the runtime seems interesting. (Just one note:The kmeans function I get when checking it in R is slightly different to your extract, so maybe there already where realized some changes). However the focus of my question was different. I would be interested in the "clever (time saving) choices" Hartigan-Wong does or the difference between the Lloyd and the Forgy algorithm. But thank you for your answer anyway! — user2974776, Dec 08 '13 at 09:18

score 3 · Answer 2 · edited May 28 '20 at 19:48

3

The k-means clustering technique: General considerations and implementation in Mathematica, Morissette & Chartier (2013) (10.20982/tqmp.09.1.p015); PDF1, PDF2

Page 16 shows a clear explanation of Gorgy/Llyods, MacQueen, Hatigan-Wong algos.

Also this question contains a small description of the Lloyd, MacQueen, and Hartigan-Wong algorithms.

edited May 28 '20 at 19:48

Gabriel

40,504
73
230
404

answered Jul 24 '18 at 14:34

Arpit Sisodia

570
5
18

2

If this link dies, this answer becomes useless. Please explain the relevant items – Toshinou Kyouko Mar 31 '19 at 17:20

K-Means: Lloyd,Forgy,MacQueen,Hartigan-Wong

2 Answers2