1

In this post there is a method to initialize the centers for the K-means algorithm in R. However, the data used therein is scalar (i.e. numbers).

A variation on this question: what if the data has multiple dimensions. In that case, the new centers should be vectors, so start should be a vector of vectors... I tried something like :

C1<- c(1,2)
C2<- c(4,-5)

to have my two initial centers, and then use

kmeans(dat, c(C1,C2))

but it didn't work. I also tried cbind() instead of c(). Same result...

gorkem
  • 731
  • 1
  • 10
  • 17
JCBR
  • 21
  • 1
  • 5
  • if you read correctly the post, the data have multiple dimensions aka 2 ...so the question is answered. After if you want to talk about the initialization method (the one in the post is basic but "poor" and can be improved), this is another topic. – Colonel Beauvel Jun 30 '15 at 18:16
  • Thanks! You are right aboud the dminesions of the data. I tried this also, but no success. I thought the problem could come from the colnames in my data.frame. I ran Results <- kmeans(data,3,...) Then I creater a matrix having exactly the same column / row names and dimensions as Results$centers, then triad again. but didn't work. A suggestion? – JCBR Jun 30 '15 at 18:40
  • if you dput your data it's easier to give advice ;) – Colonel Beauvel Jun 30 '15 at 18:41

2 Answers2

2

You expand the matrix start to have cluster rows and variables columns (dimensions), where cluster is the number of clusters you are attempting to identify and variables is the number of variables in the data set.

Here is an extension of the post you linked to, expanding the example to 3 dimensions (variables), x, y, and z:

set.seed(1)
dat <- data.frame(x = rnorm(99, mean = c(-5, 0 , 5)),
                  y = rnorm(99, mean = c(-5, 0, 5)),
                  z = rnorm(99, mean = c(-5, 2, -4)))
plot(dat)

The plot is:

enter image description here

Now we need to specify cluster centres for each of our three clusters. This is done via a matrix as before:

start <- matrix(c(-5, 0, 5, -5, 0, 5, -5, 2, -4), nrow = 3, ncol = 3)

> start
     [,1] [,2] [,3]
[1,]   -5   -5   -5
[2,]    0    0    2
[3,]    5    5   -4

Here, the important thing to note is that the clusters are in rows. The columns are coordinates on that dimension of the specified cluster centre. Hence for cluster 1 we are specifying that the centroid is at (-5,-5,-5)

Calling kmeans()

kmeans(dat, start)

results in it picking groups very close to our initial starting points (as it should for this example):

> kmeans(dat, start)
K-means clustering with 3 clusters of sizes 33, 33, 33

Cluster means:
           x           y         z
1 -4.8371412 -4.98259934 -4.953537
2  0.2106241  0.07808787  2.073369
3  4.9708243  4.77465974 -4.047120

Clustering vector:
 [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
[39] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
[77] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Within cluster sum of squares by cluster:
[1] 117.78043  77.65203  77.00541
 (between_SS / total_SS =  93.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

It is worth noting here the output for the cluster centres:

Cluster means:
           x           y         z
1 -4.8371412 -4.98259934 -4.953537
2  0.2106241  0.07808787  2.073369
3  4.9708243  4.77465974 -4.047120

This layout is exactly the same as the matrix start.

You don't have to build the matrix directly using matrix(), nor do you have to specify the centres column-wise. For example:

c1 <- c(-5, -5, -5)
c2 <- c( 0,  0,  2)
c3 <- c( 5,  5, -4)
start2 <- rbind(c1, c2, c3)

> start2
   [,1] [,2] [,3]
c1   -5   -5   -5
c2    0    0    2
c3    5    5   -4

Or

start3 <- matrix(c(-5, -5, -5,
                    0,  0,  2,
                    5,   5, -4), ncol = 3, nrow = 3, byrow = TRUE)

> start3
     [,1] [,2] [,3]
[1,]   -5   -5   -5
[2,]    0    0    2
[3,]    5    5   -4

If those are more comfortable for you.

The key thing to remember is that variables are in columns, cluster centres in the rows.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Thanks a lot! It finally worked, I'm still not sure about what my problem was, since I was entering my $3\times 9$ matrix to specify the three centers (9 variables) . – JCBR Jun 30 '15 at 19:08
1
## Your centers
C1 <- c(1, 2)
C2 <- c(4, -5)

## Simulate some data with groups around these centers
library(MASS)
set.seed(0)
dat <- rbind(mvrnorm(100, mu=C1, Sigma = matrix(c(2,3,3,10), 2)),
             mvrnorm(100, mu=C2, Sigma = matrix(c(10,3,3,2), 2)))

clusts <- kmeans(dat, rbind(C1, C2))  # get clusters with your center starting points

## Look at them
plot(dat, col=clusts$cluster)

enter image description here

Rorschach
  • 31,301
  • 5
  • 78
  • 129