Deciding to the clustering algorithm for the dataset containing both categorical and numerical variables

Question

I am a newbie in machine learning and trying to make a segmentation with clustering algorithms. However, Since my dataset has both categorical variables (such as gender, marital status, preferred social media platform etc) as well as numerical variables ( average expenditure, age, income etc.), I could not decide which algorithms worth to focus on. Which one should I try: fuzzy c means, k-medoids, or latent class to compare with k-means++? which ones would yield better results for these type of mixed datasets?

Bonus question: Should I try to do clustering without dimensionality reduction? or should I use PCA or K-PCA in any case to decrease dimensions? Also, how can I understand and interpret results without visualization if the dataset has more than 3 dimensions ?

Unless you have a programming-related question, this question is better suited for [Cross Validated](https://stats.stackexchange.com/). — Mihai Chelaru, Apr 26 '18 at 15:33
@MihaiChelaru if you suggest a different site, tell them to *not* post a duplicate, but flag for moderator migration to *move* the question, please! — Has QUIT--Anony-Mousse, Apr 27 '18 at 06:29
Will do. Most of the time I do one of those things but I realize now I should do both together. Thanks for the heads up. — Mihai Chelaru, Apr 27 '18 at 14:08

score 1 · Answer 1 · answered Apr 27 '18 at 06:14

1

The best thing to try is hierarchical agglomerative clustering with a distance metric such as Gower's.

Mixed data with different scales usually does not work in any statistical meaningful way. You have too many weights to choose, so no result will be statistically well founded, but largely a result of your weighting. So it's impossible to argue that some result is the "true" clustering. Don't expect the results to be very good thus.

answered Apr 27 '18 at 06:14

Has QUIT--Anony-Mousse

76,138
12
138
194

Can I use the Gower's results directly as an input to the hierarchical agglomerative clustering algorithm or is there any procedure in between? – Beg May 31 '18 at 06:03
Most linkages will work find with any matrix of "distance like" values. In some cases, you can even just use -1*similarity if you want to run it with a similarity matrix instead of a distance matrix. – Has QUIT--Anony-Mousse May 31 '18 at 09:45

score 0 · Answer 2 · answered Apr 26 '18 at 16:47

Generally when you have categorical data you try to encode them into a "numerical" value. Now in your case consider social media : twitter, facebook, google-plus. You might be tempted to encode them as twitter:0 , facebook: 1, google-plus: 2. But this encoding has problem: it is implying to machine learning algorithm google-plus is twice the facebook, which is not what you want.

Enter one hot encoding: it converts categorical data into vector of bits . So you will have number of bits equal to number of categories present in your data:

social media  |  binary vector (bits in order: is_twitter, is_facebook, is_google_plus)
twitter       |  1, 0, 0
facebook      |  0, 1, 0
google-plus   |  0, 0, 1

Now you can apply any ML algorithm, since all of your data is numerical.

More here: One hot encoding in scikit

Deciding to the clustering algorithm for the dataset containing both categorical and numerical variables

2 Answers2