-1

I am a newbie in machine learning and trying to make a segmentation with clustering algorithms. However, Since my dataset has both categorical variables (such as gender, marital status, preferred social media platform etc) as well as numerical variables ( average expenditure, age, income etc.), I could not decide which algorithms worth to focus on. Which one should I try: fuzzy c means, k-medoids, or latent class to compare with k-means++? which ones would yield better results for these type of mixed datasets?

Bonus question: Should I try to do clustering without dimensionality reduction? or should I use PCA or K-PCA in any case to decrease dimensions? Also, how can I understand and interpret results without visualization if the dataset has more than 3 dimensions ?

Beg
  • 405
  • 1
  • 5
  • 18

2 Answers2

1

The best thing to try is hierarchical agglomerative clustering with a distance metric such as Gower's.

Mixed data with different scales usually does not work in any statistical meaningful way. You have too many weights to choose, so no result will be statistically well founded, but largely a result of your weighting. So it's impossible to argue that some result is the "true" clustering. Don't expect the results to be very good thus.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Can I use the Gower's results directly as an input to the hierarchical agglomerative clustering algorithm or is there any procedure in between? – Beg May 31 '18 at 06:03
  • Most linkages will work find with any matrix of "distance like" values. In some cases, you can even just use -1*similarity if you want to run it with a similarity matrix instead of a distance matrix. – Has QUIT--Anony-Mousse May 31 '18 at 09:45
0

Generally when you have categorical data you try to encode them into a "numerical" value. Now in your case consider social media : twitter, facebook, google-plus. You might be tempted to encode them as twitter:0 , facebook: 1, google-plus: 2. But this encoding has problem: it is implying to machine learning algorithm google-plus is twice the facebook, which is not what you want.

Enter one hot encoding: it converts categorical data into vector of bits . So you will have number of bits equal to number of categories present in your data:

social media  |  binary vector (bits in order: is_twitter, is_facebook, is_google_plus)
twitter       |  1, 0, 0
facebook      |  0, 1, 0
google-plus   |  0, 0, 1

Now you can apply any ML algorithm, since all of your data is numerical.

More here: One hot encoding in scikit

bits
  • 1,595
  • 1
  • 17
  • 17