-1

I would like to know how I can cluster a multivariate dataset using K-means. Each sample in this dataset corresponds to a Person (I have 6000 people), and each Person has both continuous and discrete attributes (10 attributes/Person). An example:

  • person_id: 1234
  • name: "John Doe"
  • age: 30
  • height: '5 ft 10 in'
  • salary_value: 5000
  • Salary_currency: USD
  • is_customer: False
  • Company: "Testing Inc."
  • ...

I have read an existing answer on multidimensional k-means clustering, but the attributes in the dataset there are all continuous. Even a more helpful reading was a post about clustering algorithm for continuous and discrete variables. As mentioned in the latter, I accept I may have to find a function that values discrete states. But I cannot use ROCK or COBWEB for clustering, only k-means.

Which functions can I use to convert the discrete values to continuous ones? Furthermore, is there any way I can prioritize the attributes also (say clustering based on Salary/Age is more important than height), or should I just revamp the whole approach?

Mojtaba Ahmadi
  • 1,044
  • 19
  • 38
darthbhyrava
  • 505
  • 4
  • 14
  • If k-means is not a strict requirement, then you should also look at clustering algorithms like HDBSCAN, DBSCAN etc. I hope you have looked [here](http://scikit-learn.org/stable/modules/clustering.html). – jar Oct 18 '18 at 06:30

2 Answers2

1

K-means algorithm performs the clustering on the data points with continuous features.

The way to convert the discrete features into continuous is one hot encoding.This convert categorical features like company name into numerical array. You can see the documentation here.

You also need to normalize every features to bring them in same range say 0 to 1. To give importance to some features keep the range of the important features higher.

Keval Dave
  • 2,777
  • 1
  • 13
  • 16
1

Don't use k-means on such data!

K-means is built around three important assumptions:

  1. The mean of each attribute is representative of the data
  2. The squared deviations are to be minimized
  3. They are all equally important

These assumptions in k-means imply that you should only use it on interval scale variables (1), that are not skewed (2), and that have comparable value domains (don't mix different units / scales; such as salary, age and height) (3).

One hot encoding of categories does not make them interval scaled. If you just cast the data into some IR^p vector space, you will get "some output", but it is not good in any objective way. You answer the wrong question, because you did not bother to formulate the question in the first place.

If you are lucky there is a single attribute (in your case probably salary) that dominates the result, and all the others do not affect the result anyway...

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194