I've been trying to perform clustering using NBClust library. My set included categorical and numerical variables and I have one-hot encoded categorical ones. The results obtained with this method made sense but I have been told that if set includes categorical variables K-modes should be used instead of NBClust. Can anyone tell me why is it better if there are categorical variables involved and then how to choose the most suitable number of iterations in it?
Asked
Active
Viewed 277 times
1 Answers
1
K-modes is more appropriate for categoricial data because it chooses the mode.
With one-hot encoding, your problem is that the resulting vectors do no longer correspond to actual categories. You'll get vectors such as (0.3,0.3,0.1,0.3) that you cannot well interpret as categories, can you? So what are these algorithms then doing? What are they optimizing?
One-hot encoding data is an ugly hack, not a solution.

Has QUIT--Anony-Mousse
- 76,138
- 12
- 138
- 194