K-modes clustering in R for categorical data with NAs

Question

dat <- data.frame(x=sample(letters[1:3],20,TRUE),y=sample(LETTERS[7:9],20,TRUE),stringsAsFactors=FALSE)
dat[c(1:5,9,17,20),1] <- NA;dat[c(8,11),2] <- NA
dat
      x    y
1  <NA>    H
2  <NA>    I
3  <NA>    G
4  <NA>    H
5  <NA>    I
6     c    H
7     c    G
8     a <NA>
9  <NA>    G
10    c    G
11    b <NA>
12    a    G
13    a    G
14    a    G
15    b    I
16    a    G
17 <NA>    H
18    a    I
19    a    G
20 <NA>    G

I'm trying to do clustering on this categorical data using klaR::kmodes, but have trouble dealing with these NAs.

A workaround I came up with is treating NAs as a new category:

dat[c(1:5,9,17,20),1] <- "NA";dat[c(8,11),2] <- "NA"
(cl <- kmodes(dat,modes=dat[c(6,7),]))
K-modes clustering with 2 clusters of sizes 11, 9

Cluster modes:
  x    y  
1 "NA" "H"
2 "a"  "G"

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 1  1  1  1  1  1  2  2  1  2  1  2  2  2  1  2  1  2  2  1 

Within cluster simple-matching distance by cluster:
[1] 10  4

Available components:
[1] "cluster"    "size"       "modes"      "withindiff" "iterations" "weighted"

This is flawed since kmodes by default uses simple-matching distance to determine the dissimilarity of two objects, thus we'll have NA and NA as a match.

Another thought is to treat every NA different, i.e. in my data there are 8 NAs in x, so I can treat them as 8 different categories?

dat[c(1:5,9,17,20),1] <- paste("NA",1:8,sep=""); dat[c(8,11),2] <- paste("NA",1:2,sep="")
(cl <- kmodes(dat,modes=dat[c(6,7),]))
K-modes clustering with 2 clusters of sizes 10, 10

Cluster modes:
  x   y  
1 "c" "H"
2 "a" "G"

Clustering vector:
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 1  1  2  1  1  1  1  2  2  1  1  2  2  2  1  2  1  2  2  2 

Within cluster simple-matching distance by cluster:
[1] 13  5

Available components:
[1] "cluster"    "size"       "modes"      "withindiff" "iterations" "weighted"

Any comments or new solutions are appreciated.

K-modes clustering in R for categorical data with NAs

0 Answers0