Clustering for Categorical and Numerical data

Question

I have a collection of alerts and I want to group it based on similarity/distance. As we have non-numeric data, How can i perform clustering for this kind of data.

  set.seed(42)   
  data.frame(Host1 = rep("del",10), 
  Host2 = c(rep("cpp",4), rep("sscp",3), rep("portal",3)),
 Host3 = c(rep("web",5), rep("apache",3), rep("app",2)), 
 Host4 = c(sample(3,8, replace = TRUE), rep("con",2)), 
 Date1 = abs(round(1:10 + rnorm(10),2))) 



   Host1  Host2  Host3 Host4 Date1
1    del    cpp    web     3  1.40
2    del    cpp    web     3  1.89
3    del    cpp    web     1  4.51
4    del    cpp    web     3  3.91
5    del   sscp    web     2  7.02
6    del   sscp apache     2  5.94
7    del   sscp apache     3  8.30
8    del portal apache     1 10.29
9    del portal    app   con  7.61
10   del portal    app   con  9.72

Looking forward to build clusters.

score 2 · Answer 1 · answered Nov 13 '15 at 12:05

2

K-means only works for numerical (continuous) data

By definition, it minimizes squared deviations. Minimizing squared deviations only make sense on continuous data. Any kind of one-hot-encoding is only a hack; it makes the data types compatible, but not the approach sensible.

What is your similarity / distance?

Hierarchical clustering would work. If you can define a meaningful distance function that quantifies distance. But this is application dependant. We do not have your data, and do not understand your problem. We cannot solve this for you.

answered Nov 13 '15 at 12:05

Has QUIT--Anony-Mousse

76,138
12
138
194

i know this obvious fact. That is why i put this question and wonder if we can do something to perform clustering – Navin Manaswi Nov 13 '15 at 16:26
1

Before you edited your question, you asked whether k-means works... anyway, I already mentioned that **you need to define what is similar**. There is no way around this, and it is specific to your use case. – Has QUIT--Anony-Mousse Nov 13 '15 at 16:30

Clustering for Categorical and Numerical data

1 Answers1

K-means only works for numerical (continuous) data

What is your similarity / distance?

Linked