K-Means clustering for multivariate data (with both discrete and continuous attributes)

Question

I would like to know how I can cluster a multivariate dataset using K-means. Each sample in this dataset corresponds to a Person (I have 6000 people), and each Person has both continuous and discrete attributes (10 attributes/Person). An example:

person_id: 1234

name: "John Doe"

age: 30

height: '5 ft 10 in'

salary_value: 5000

Salary_currency: USD

is_customer: False

Company: "Testing Inc."

...

I have read an existing answer on multidimensional k-means clustering, but the attributes in the dataset there are all continuous. Even a more helpful reading was a post about clustering algorithm for continuous and discrete variables. As mentioned in the latter, I accept I may have to find a function that values discrete states. But I cannot use ROCK or COBWEB for clustering, only k-means.

Which functions can I use to convert the discrete values to continuous ones? Furthermore, is there any way I can prioritize the attributes also (say clustering based on Salary/Age is more important than height), or should I just revamp the whole approach?

If k-means is not a strict requirement, then you should also look at clustering algorithms like HDBSCAN, DBSCAN etc. I hope you have looked [here](http://scikit-learn.org/stable/modules/clustering.html). — jar, Oct 18 '18 at 06:30

Keval Dave · Answer 1 · 2018-10-18T06:48:41.830

1

K-means algorithm performs the clustering on the data points with continuous features.

The way to convert the discrete features into continuous is one hot encoding.This convert categorical features like company name into numerical array. You can see the documentation here.

You also need to normalize every features to bring them in same range say 0 to 1. To give importance to some features keep the range of the important features higher.

edited Oct 18 '18 at 06:48

answered Oct 18 '18 at 06:24

Keval Dave

2,777
1
13
16

This answer is the right one, you should use the one hot encoding. – Daneel R. Oct 18 '18 at 08:26
No, one hot encoding is a hack that does *not* work well for k-means. While it makes the code run, it does more harm than good for the quality of the result. – Has QUIT--Anony-Mousse Oct 19 '18 at 08:31

Has QUIT--Anony-Mousse · Accepted Answer · 2018-10-19T08:41:17.563

Don't use k-means on such data!

K-means is built around three important assumptions:

The mean of each attribute is representative of the data
The squared deviations are to be minimized
They are all equally important

These assumptions in k-means imply that you should only use it on interval scale variables (1), that are not skewed (2), and that have comparable value domains (don't mix different units / scales; such as salary, age and height) (3).

One hot encoding of categories does not make them interval scaled. If you just cast the data into some IR^p vector space, you will get "some output", but it is not good in any objective way. You answer the wrong question, because you did not bother to formulate the question in the first place.

If you are lucky there is a single attribute (in your case probably salary) that dominates the result, and all the others do not affect the result anyway...

K-Means clustering for multivariate data (with both discrete and continuous attributes)

2 Answers2

Don't use k-means on such data!