K-Means Distance Measure - Large Data and mixed Scales

Question

I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.

Cheers,

Martin

Take a look at Gower distance. It is implemented in the `daisy` function available in the `cluster` package. — G5W, Aug 21 '18 at 18:35
Thank you. I already did that but I guess it is based od proximity matrix and thus, I get the Error "Error: cannot allocate vector of size 54.2 Gb". I guess Gower does not work for large data or did I something wrong? — , Aug 21 '18 at 19:01

score 0 · Answer 1 · answered Aug 21 '18 at 18:32

One approach would be to normalize the features and then just use the 11-dimensional Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.

I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.

score 0 · Answer 2 · answered Aug 22 '18 at 00:06

You can certainly encode there binary variables as 0,1 too.

It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.

But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.

Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.

score 0 · Answer 3 · answered Aug 22 '18 at 03:42

Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.

Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.

Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes? Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.

score 0 · Answer 4 · answered Sep 13 '18 at 20:39

Mixture models can be used to cluster mixed data.

You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.

Moreover, missing values can be managed by the model at hand.

A tutorial is available at: http://varsellcm.r-forge.r-project.org/

K-Means Distance Measure - Large Data and mixed Scales

4 Answers4