12

I need to cluster some data and I tried kmeans, pam, and clara with R.

The problem is that my data are in a column of a data frame, and contains NAs.

I used na.omit() to get my clusters. But then how can I associate them with the original data? The functions return a vector of integers without the NAs and they don't retain any information about the original position.

Is there a clever way to associate the clusters to the original observations in the data frame? (or a way to intelligently perform clustering when NAs are present?)

Thanks

agenis
  • 8,069
  • 5
  • 53
  • 102
Bakaburg
  • 3,165
  • 4
  • 32
  • 64
  • have you named your rows? i think kmeans and pam (at least) keep the row names, don't they? – agenis Dec 18 '14 at 12:01
  • I do this way: kmeans(na.omit(x), k) – Bakaburg Dec 18 '14 at 12:12
  • The cluster vectors (e.g. `clus$cluster`) corresponds to the non-`NA` elements of `x`. So the indices of `x` that the elements of `clus$cluster` correspond to are `which(!is.na(x))`. – jbaums Dec 18 '14 at 12:18

2 Answers2

11

The output of kmeans corresponds to the elements of the object passed as argument x. In your case, you omit the NA elements, and so $cluster indicates the cluster that each element of na.omit(x) belongs to.

Here's a simple example:

d <- data.frame(x=runif(100), cluster=NA)
d$x[sample(100, 10)] <- NA
clus <- kmeans(na.omit(d$x), 5)

d$cluster[which(!is.na(d$x))] <- clus$cluster

And in the plot below, colour indicates the cluster that each point belongs to.

plot(d$x, bg=d$cluster, pch=21)

enter image description here

jbaums
  • 27,115
  • 5
  • 79
  • 119
1

This code works for me, starting with a matrix containing a whole row of NAs:

DF=matrix(rnorm(100), ncol=10)
row.names(DF) <- paste("r", 1:10, sep="")
DF[3,]<-NA
res <- kmeans(na.omit(DF), 3)$cluster
res
DF=cbind(DF, 'clus'=NA)
DF[names(res),][,11] <- res
print(DF[,11])
agenis
  • 8,069
  • 5
  • 53
  • 102