k-means return value in R

Question

I am using the kmeans() function in R and I was curious what is the difference between the totss and tot.withinss attributes of the returned object. From the documentation they seem to be returning the same thing, but applied on my dataset the value of totss is 66213.63 and for tot.withinss is 6893.50. Please let me know if you are familiar with mroe details. Thank you!

Marius.

G. Grothendieck · Accepted Answer · 2016-10-19T10:27:22.870

Given the between sum of squares betweenss and the vector of within sum of squares for each cluster withinss the formulas are these:

totss = tot.withinss + betweenss
tot.withinss = sum(withinss)

For example, if there were only one cluster then betweenss would be 0, there would be only one component in withinss and totss = tot.withinss = withinss.

For further clarification, we can compute these various quantities ourselves given the cluster assignments and that may help clarify their meanings. Consider the data x and the cluster assignments cl$cluster from the example in help(kmeans). Define the sum of squares function as follows -- this subtracts the mean of each column of x from that column and then sums of the squares of each element of the remaining matrix:

# or ss <- function(x) sum(apply(x, 2, function(x) x - mean(x))^2)
ss <- function(x) sum(scale(x, scale = FALSE)^2)

Then we have the following. Note that cl$centers[cl$cluster, ] are the fitted values, i.e. it iis a matrix with one row per point such that the ith row is the center of the cluster that the ith point belongs to.

example(kmeans) # create x and cl

betweenss <- ss(cl$centers[cl$cluster,]) # or ss(fitted(cl))

withinss <- sapply(split(as.data.frame(x), cl$cluster), ss)
tot.withinss <- sum(withinss) # or  resid <- x - fitted(cl); ss(resid)

totss <- ss(x) # or tot.withinss + betweenss

cat("totss:", totss, "tot.withinss:", tot.withinss, 
  "betweenss:", betweenss, "\n")

# compare above to:

str(cl)

EDIT:

Since this question was answered, R has added additional similar kmeans examples (example(kmeans)) and a new fitted.kmeans method and we now show how the fitted method fits into the above in the comments trailing the code lines.

Ahum. So the *tot.withinss* should be the total within cluster variation and the *totss* should be the overall data variation. total within cluster variation + the ss of the cluster centers. Right? — Marius, Dec 26 '11 at 17:50
So, if one wants to find out the total within cluster variation, then *tot.whitinss* is the one. Thank you. — Marius, Dec 26 '11 at 18:50
that means the higher the between_SS / total_SS (%), the better the clustering?? isn't it?? — ToNoY, Jul 18 '13 at 03:55
@ToNoY, Right. Try `kmeans(rep(1:2, 4), 2)` to see a perfect fit. — G. Grothendieck, Jul 18 '13 at 15:54
@G.Grothendieck, would you mind explaining what this line is doing: `betweenss <- ss(cl$centers[cl$cluster,])`? — user2117258, Oct 19 '16 at 03:51

score 0 · Answer 2 · answered Dec 26 '11 at 16:46

0

I think you have spotted an error in the documentation ... which says:

withinss     The within-cluster sum of squares for each cluster.
totss        The total within-cluster sum of squares.
tot.withinss     Total within-cluster sum of squares, i.e., sum(withinss).

If you use the sample dataset in the help page example:

> kmeans(x,2)$tot.withinss
[1] 15.49669
> kmeans(x,2)$totss
[1] 65.92628
> kmeans(x,2)$withinss
[1] 7.450607 8.046079

I think someone should write a request to the r-devel mailing list asking that the help page be revised. I'm willing to do so if you don't want to.

answered Dec 26 '11 at 16:46

IRTFM

258,963
21
364
487

Thanks for the quick reaction. I was thinking the same.. that there is an error in the doc.. unfortunately not the only one as I saw. You can write if you want a request to them. The main point is that I am also using a genetic k-means algorithm and I wanted to compare the results. Now I do not know which should be the one to take in consideration.. – Marius Dec 26 '11 at 17:11
Which one to do what with? (There are too many pronouns and adjectives, not enough nouns in either your statement of confusion or my counter-question.) – IRTFM Dec 26 '11 at 17:48
:/ if there is no programming language syntax involved you pick on grammar? I wanted to compare the results of the genetic k-means algorithm with the results of the kmeans function in R. The main point is to minimize the within cluster variation. The returned kmeans object in R has 2 attributes defined the same in the doc. There is only one result that is to be compared. – Marius Dec 26 '11 at 17:56
Semantics, not grammar. I cannot (even yet) figure out what sort of comparison you want to do. How am I supposed to know what the "genetic k-means" function, however it might be implemented, is returning? – IRTFM Dec 26 '11 at 18:13
Hehehe.. I was sure you're gonna say semantics :] Well the idea is that the kmeans algorithm is an optimization problem. Genetic algorithms have a pretty large applicability in this domain and the fitness function (in this case) is the minimization of the total within cluster variation, based on the sum of squares. So the genetic algorithm also returns the total within cluster variation but the kmeans object that the R function returns has two attributes that are supposed to deliver the same characteristic. I do not know which of the two is the right one that I can compare with my results. – Marius Dec 26 '11 at 18:25
If our terminology is shared then you probably want the `tot.withinss`. A global F-test would be comparing the tot.residual /n-classes to the SSresidual/residual-df where SSresidual = totss-tot.withinss. – IRTFM Dec 26 '11 at 18:31
Yes true, but I was hoping that for this particular comparison (with the kmeans in R) I do not need to look into the residuals. Seen that I have already enough info from the return value to form an idea. Thank you for your interest. Sorry if the discussion lost track for a moment :] – Marius Dec 26 '11 at 19:02

k-means return value in R

2 Answers2