Problem with Ruining c-mean clustering on my data in R program

Question

For this data how to fix this problem

> x=data.frame(c(v1="a" ,"b" ,"c" ,"d" ,"e"),
+ v2=c(97 ,90 ,93 ,97 ,90),
+ v3=c( 85 ,91 ,87 ,91 ,93))
> library(e1071)
> f <- cmeans(x, 2)
Error in cmeans(x, 2) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In cmeans(x, 2) : NAs introduced by coercion
2: In cmeans(x, 2) : NAs introduced by coercion
> f

I want to apply c-maen to my data as is illustrated code in above, it contains three vectors: v1,v2 ,v2 I want to apply c-mean label by vector (v1)

score 2 · Accepted Answer · answered May 09 '19 at 21:06

If we look at the documentation of ?cmeans,

x - The data matrix where columns correspond to variables and rows to observations.

So, we can convert the data.frame to matrix after removing the character column (1st column)

x1 <- as.matrix(x[-1])
row.names(x1) <- x[,1]
cmeans(x1, 2)
#Fuzzy c-means clustering with 2 clusters

#Cluster centers:
#        v2       v3
#1 90.30090 91.85191
#2 95.75436 87.22535

#Memberships:
#           1          2
#a 0.06614213 0.93385787
#b 0.98305641 0.01694359
#c 0.19855988 0.80144012
#d 0.25730888 0.74269112
#e 0.97924422 0.02075578

#Closest hard clustering:
#a b c d e 
#2 1 2 2 1 

#Available components:
#[1] "centers"     "size"        "cluster"     "membership"  "iter"        "withinerror" "call"

score 0 · Answer 2 · answered May 10 '19 at 06:17

The k-mean family of partitional clustering algorithm works on the principle of mean which by its nature will accept only numeric values. You are getting an error because, the dataframe consist of both numeric and categorical values, which c-mean() does not like. Also, there is no need to convert the dataframe to matrix because that is not the actual problem.

Therefore,

Alternative approach

Discretize the character variable to assign it numbers and then apply clustering. This way there is no need to drop any variable.

# create empty data frame
df<- setNames(data.frame(matrix(ncol = 5, nrow = 5)), c("a" ,"b" ,"c" ,"d" ,"e"))

# fill values
df$a<- c("aaaa" ,"bbbb" ,"cccc" ,"dddd" ,"eeee")
df$b<- c(97 ,90 ,93 ,97 ,90)
df$c<- c(97 ,90 ,93 ,97 ,90)
df$d<- c( 85 ,91 ,87 ,91 ,93)
df$e<- c( 85 ,91 ,87 ,91 ,93)

# show the dataframe
df
 a  b  c  d  e
1 aaaa 97 97 85 85
2 bbbb 90 90 91 91
3 cccc 93 93 87 87
4 dddd 97 97 91 91
5 eeee 90 90 93 93

# Discretize the character variable
df$a <- as.numeric( factor(df$a) ) -1
df
  a  b  c  d  e
1 0 97 97 85 85
2 1 90 90 91 91
3 2 93 93 87 87
4 3 97 97 91 91
5 4 90 90 93 93

# Apply clustering
library(e1071)
cmeans(df, 2)
Fuzzy c-means clustering with 2 clusters

Cluster centers:
      a     b     c     d     e
1 1.406 95.72 95.72 87.18 87.18
2 2.510 90.36 90.36 91.85 91.85

Memberships:
           1       2
[1,] 0.92728 0.07272
[2,] 0.04014 0.95986
[3,] 0.80061 0.19939
[4,] 0.72009 0.27991
[5,] 0.03544 0.96456

Closest hard clustering:
[1] 1 2 1 1 2

Available components:
[1] "centers"     "size"        "cluster"     "membership"  "iter"       
[6] "withinerror" "call"

Problem with Ruining c-mean clustering on my data in R program

2 Answers2