You expand the matrix start
to have cluster rows and variables columns (dimensions), where cluster is the number of clusters you are attempting to identify and variables is the number of variables in the data set.
Here is an extension of the post you linked to, expanding the example to 3 dimensions (variables), x
, y
, and z
:
set.seed(1)
dat <- data.frame(x = rnorm(99, mean = c(-5, 0 , 5)),
y = rnorm(99, mean = c(-5, 0, 5)),
z = rnorm(99, mean = c(-5, 2, -4)))
plot(dat)
The plot is:

Now we need to specify cluster centres for each of our three clusters. This is done via a matrix as before:
start <- matrix(c(-5, 0, 5, -5, 0, 5, -5, 2, -4), nrow = 3, ncol = 3)
> start
[,1] [,2] [,3]
[1,] -5 -5 -5
[2,] 0 0 2
[3,] 5 5 -4
Here, the important thing to note is that the clusters are in rows. The columns are coordinates on that dimension of the specified cluster centre. Hence for cluster 1 we are specifying that the centroid is at (-5,-5,-5)
Calling kmeans()
kmeans(dat, start)
results in it picking groups very close to our initial starting points (as it should for this example):
> kmeans(dat, start)
K-means clustering with 3 clusters of sizes 33, 33, 33
Cluster means:
x y z
1 -4.8371412 -4.98259934 -4.953537
2 0.2106241 0.07808787 2.073369
3 4.9708243 4.77465974 -4.047120
Clustering vector:
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
[39] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
[77] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
Within cluster sum of squares by cluster:
[1] 117.78043 77.65203 77.00541
(between_SS / total_SS = 93.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
It is worth noting here the output for the cluster centres:
Cluster means:
x y z
1 -4.8371412 -4.98259934 -4.953537
2 0.2106241 0.07808787 2.073369
3 4.9708243 4.77465974 -4.047120
This layout is exactly the same as the matrix start
.
You don't have to build the matrix directly using matrix()
, nor do you have to specify the centres column-wise. For example:
c1 <- c(-5, -5, -5)
c2 <- c( 0, 0, 2)
c3 <- c( 5, 5, -4)
start2 <- rbind(c1, c2, c3)
> start2
[,1] [,2] [,3]
c1 -5 -5 -5
c2 0 0 2
c3 5 5 -4
Or
start3 <- matrix(c(-5, -5, -5,
0, 0, 2,
5, 5, -4), ncol = 3, nrow = 3, byrow = TRUE)
> start3
[,1] [,2] [,3]
[1,] -5 -5 -5
[2,] 0 0 2
[3,] 5 5 -4
If those are more comfortable for you.
The key thing to remember is that variables are in columns, cluster centres in the rows.