K-means clustering on data set with mixed data using Scikit-learn

Question

I am experimenting with machine learning algorithms and have a pretty large data set containing both numerical and categorical data. I followed this post here: http://www.ritchieng.com/machinelearning-one-hot-encoding/ to encode categorical features to numerical:

I want to try K-means clustering of the whole data set for instance. I am not sure how to use this encoded data array I have now as a part of the original data frame in order to run machine learning algorithms. I would really appreciate an example.

score 3 · Accepted Answer · answered Feb 21 '18 at 16:54

I suppose that you have one-hot-encoded your data. In order to use K-means clustering then, it is important to rescale your data because you might have some numerical features which will dominate your clustering. You may try several rescalers from here (the most famous are MinMaxScaler and StandardScaler).

After that you can refer here to see how to use KMeans with sklearn. In general the steps are the following:

You import KMeans:

from sklearn.cluster import KMeans

You instantiate an KMeans object, specifying at least the number of clusters, here I put arbitrarily 8:

kmeans = KMeans(n_clusters = 8)

Then you fit the object with the data (here my data is named X):

kmeans.fit(X)

After that you can see the cluster assigned to each row using .labels_:

kmeans.labels_

You may also predict the cluster for a new and unseen data (named lets say new_X) using .predict:

kmeans.predict(new_X)

Thanks @elkoul for pointing out rescaling. What I am most wondering is how to use this one-hot encoded data with the normal data frame i.e how to append this big array consisting of 0's and 1's to the data frame that contains floating point numbers. — moirK, Feb 22 '18 at 08:55
One easy way to do one hot encoding is to use `pandas.get_dummies`. You may have a look [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html). — ekoulier, Feb 23 '18 at 09:08

K-means clustering on data set with mixed data using Scikit-learn

1 Answers1