0

I want to use the K-prototype algorithm (a type of KNN algorithm used for mixed data :numerical and categorical data) for a clustering problem.
The algorithm handles the categorical values without numerical encoding, so I don't need to encode them to numerical values.
My question is : do we need to standardize the numerical columns before applying k-prototypes? For example, I have the following columns: age(float), salary(float), gender(object), city(object), profession(object).
Do I need to apply standardization like this?

from sklearn.preprocessing import StandardScaler
scaled_X = StandardScaler().fit_transform(X[['salary', 'age']])
X[['salary', 'age']] = scaled_X


But I think that standardization has no value if it is not applied to all columns,because its goal is to make all variables on the same scale and not just some columns! so in this case, we do not need to apply it!
I hope I explained my question well, Thank you.

  • More of a question for the Cross Validated Stack Exchange. The short answer is "it depends on the problem". k-Prototypes has a hyperparameter controlling how much the categorical features matter, so I wouldn't say that all columns are not standardized. – David Eisenstat Aug 02 '22 at 12:18
  • @DavidEisenstat Can we standardize just the numerical columns and leave the categorical ones as they are? standardization still be useful in this case? – – anotherUser Aug 02 '22 at 13:56
  • It could be. In essence by standardizing you're saying that all of the columns should have the same weight, deservedly or not. It's not an unreasonable assumption if you don't know anything about your data, but I bet you can often do better in practice. – David Eisenstat Aug 03 '22 at 17:06
  • I had to look up k-prototypes because I wasn't familiar with it, and, if you squint, it's basically one-hot encoding the categorical features (with a scaling hyperparameter) and running a k-means/k-medians type algorithm. Not groundbreaking research IMO. – David Eisenstat Aug 03 '22 at 17:08

0 Answers0