dat <- data.frame(x=sample(letters[1:3],20,TRUE),y=sample(LETTERS[7:9],20,TRUE),stringsAsFactors=FALSE)
dat[c(1:5,9,17,20),1] <- NA;dat[c(8,11),2] <- NA
dat
x y
1 <NA> H
2 <NA> I
3 <NA> G
4 <NA> H
5 <NA> I
6 c H
7 c G
8 a <NA>
9 <NA> G
10 c G
11 b <NA>
12 a G
13 a G
14 a G
15 b I
16 a G
17 <NA> H
18 a I
19 a G
20 <NA> G
I'm trying to do clustering on this categorical data using klaR::kmodes
, but have trouble dealing with these NAs.
A workaround I came up with is treating NAs as a new category:
dat[c(1:5,9,17,20),1] <- "NA";dat[c(8,11),2] <- "NA"
(cl <- kmodes(dat,modes=dat[c(6,7),]))
K-modes clustering with 2 clusters of sizes 11, 9
Cluster modes:
x y
1 "NA" "H"
2 "a" "G"
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 2 2 1 2 1 2 2 2 1 2 1 2 2 1
Within cluster simple-matching distance by cluster:
[1] 10 4
Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"
This is flawed since kmodes
by default uses simple-matching distance to determine the dissimilarity of two objects, thus we'll have NA and NA as a match.
Another thought is to treat every NA different, i.e. in my data there are 8 NAs in x
, so I can treat them as 8 different categories?
dat[c(1:5,9,17,20),1] <- paste("NA",1:8,sep=""); dat[c(8,11),2] <- paste("NA",1:2,sep="")
(cl <- kmodes(dat,modes=dat[c(6,7),]))
K-modes clustering with 2 clusters of sizes 10, 10
Cluster modes:
x y
1 "c" "H"
2 "a" "G"
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 2 1 1 1 1 2 2 1 1 2 2 2 1 2 1 2 2 2
Within cluster simple-matching distance by cluster:
[1] 13 5
Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"
Any comments or new solutions are appreciated.