Given the between sum of squares betweenss
and the vector of within sum of squares for each cluster withinss
the formulas are these:
totss = tot.withinss + betweenss
tot.withinss = sum(withinss)
For example, if there were only one cluster then betweenss
would be 0
, there would be only one component in withinss
and totss = tot.withinss = withinss
.
For further clarification, we can compute these various quantities ourselves given the cluster assignments and that may help clarify their meanings. Consider the data x
and the cluster assignments cl$cluster
from the example in help(kmeans)
. Define the sum of squares function as follows -- this subtracts the mean of each column of x from that column and then sums of the squares of each element of the remaining matrix:
# or ss <- function(x) sum(apply(x, 2, function(x) x - mean(x))^2)
ss <- function(x) sum(scale(x, scale = FALSE)^2)
Then we have the following. Note that cl$centers[cl$cluster, ]
are the fitted values, i.e. it iis a matrix with one row per point such that the ith row is the center of the cluster that the ith point belongs to.
example(kmeans) # create x and cl
betweenss <- ss(cl$centers[cl$cluster,]) # or ss(fitted(cl))
withinss <- sapply(split(as.data.frame(x), cl$cluster), ss)
tot.withinss <- sum(withinss) # or resid <- x - fitted(cl); ss(resid)
totss <- ss(x) # or tot.withinss + betweenss
cat("totss:", totss, "tot.withinss:", tot.withinss,
"betweenss:", betweenss, "\n")
# compare above to:
str(cl)
EDIT:
Since this question was answered, R has added additional similar kmeans
examples (example(kmeans)
) and a new fitted.kmeans
method and we now show how the fitted method fits into the above in the comments trailing the code lines.