Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values
, outliers
, zero variance
, principal component analysis (continuous variables)
, correspondence analysis (for categorical variables)
etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80%
of analysis.
Regarding the distance measure for mixed data type, you do understand the mean
in k
will work only for continuous
variable. So, I do not understand the logic
of using the algorithm k-means
for mixed datatypes?
Consider choosing other algorithm like k-modes
. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.