2

I have a collection of alerts and I want to group it based on similarity/distance. As we have non-numeric data, How can i perform clustering for this kind of data.

  set.seed(42)   
  data.frame(Host1 = rep("del",10), 
  Host2 = c(rep("cpp",4), rep("sscp",3), rep("portal",3)),
 Host3 = c(rep("web",5), rep("apache",3), rep("app",2)), 
 Host4 = c(sample(3,8, replace = TRUE), rep("con",2)), 
 Date1 = abs(round(1:10 + rnorm(10),2))) 



   Host1  Host2  Host3 Host4 Date1
1    del    cpp    web     3  1.40
2    del    cpp    web     3  1.89
3    del    cpp    web     1  4.51
4    del    cpp    web     3  3.91
5    del   sscp    web     2  7.02
6    del   sscp apache     2  5.94
7    del   sscp apache     3  8.30
8    del portal apache     1 10.29
9    del portal    app   con  7.61
10   del portal    app   con  9.72

Looking forward to build clusters.

989
  • 12,579
  • 5
  • 31
  • 53
Navin Manaswi
  • 964
  • 7
  • 19

1 Answers1

2

K-means only works for numerical (continuous) data

By definition, it minimizes squared deviations. Minimizing squared deviations only make sense on continuous data. Any kind of one-hot-encoding is only a hack; it makes the data types compatible, but not the approach sensible.

What is your similarity / distance?

Hierarchical clustering would work. If you can define a meaningful distance function that quantifies distance. But this is application dependant. We do not have your data, and do not understand your problem. We cannot solve this for you.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • i know this obvious fact. That is why i put this question and wonder if we can do something to perform clustering – Navin Manaswi Nov 13 '15 at 16:26
  • 1
    Before you edited your question, you asked whether k-means works... anyway, I already mentioned that **you need to define what is similar**. There is no way around this, and it is specific to your use case. – Has QUIT--Anony-Mousse Nov 13 '15 at 16:30