The k-mean
family of partitional clustering algorithm works on the principle of mean
which by its nature will accept only numeric values. You are getting an error because, the dataframe consist of both numeric and categorical values, which c-mean()
does not like. Also, there is no need to convert the dataframe to matrix because that is not the actual problem.
Therefore,
Alternative approach
Discretize the character variable to assign it numbers and then apply clustering. This way there is no need to drop any variable.
# create empty data frame
df<- setNames(data.frame(matrix(ncol = 5, nrow = 5)), c("a" ,"b" ,"c" ,"d" ,"e"))
# fill values
df$a<- c("aaaa" ,"bbbb" ,"cccc" ,"dddd" ,"eeee")
df$b<- c(97 ,90 ,93 ,97 ,90)
df$c<- c(97 ,90 ,93 ,97 ,90)
df$d<- c( 85 ,91 ,87 ,91 ,93)
df$e<- c( 85 ,91 ,87 ,91 ,93)
# show the dataframe
df
a b c d e
1 aaaa 97 97 85 85
2 bbbb 90 90 91 91
3 cccc 93 93 87 87
4 dddd 97 97 91 91
5 eeee 90 90 93 93
# Discretize the character variable
df$a <- as.numeric( factor(df$a) ) -1
df
a b c d e
1 0 97 97 85 85
2 1 90 90 91 91
3 2 93 93 87 87
4 3 97 97 91 91
5 4 90 90 93 93
# Apply clustering
library(e1071)
cmeans(df, 2)
Fuzzy c-means clustering with 2 clusters
Cluster centers:
a b c d e
1 1.406 95.72 95.72 87.18 87.18
2 2.510 90.36 90.36 91.85 91.85
Memberships:
1 2
[1,] 0.92728 0.07272
[2,] 0.04014 0.95986
[3,] 0.80061 0.19939
[4,] 0.72009 0.27991
[5,] 0.03544 0.96456
Closest hard clustering:
[1] 1 2 1 1 2
Available components:
[1] "centers" "size" "cluster" "membership" "iter"
[6] "withinerror" "call"