Can I group my data based on clusters generated by KMeans Clustering as input for a supervised learning model?

Question

I have taken this dataframe train_df and run it through a PCA to reduce it down to two dimensions.

To generate the principal components I used:

pca = PCA(n_components = 2)
pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)])
train_pca = pipe.fit_transform(train_df)

Next I wanted to run this through a KMeans Clustering algorithm. To assess how many clusters to use I plotted the following

wcss = []
r = range(1, 20)
for k in r:
    km = KMeans(n_clusters = k, init='k-means++', max_iter=300, n_init=10)
    km.fit(train_pca)
    wcss.append(km.inertia_)

plt.plot(r, wcss)
plt.title('Within Cluster Sum of Squares')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

I used KneeLocator to land on 5 clusters then ran the KMeans and plotted below:

labels = ['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5']
km = KMeans(n_clusters = 5, init = 'k-means++', max_iter = 300, n_init = 10)
pred_y = km.fit_predict(train_pca)
plt.figure(figsize = (12,10))
plot = plt.scatter(train_pca[:,0], train_pca[:,1], c = pred_y)
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=300, c='red')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Principle Components with Clusters')
plt.legend(handles=plot.legend_elements()[0], labels = labels)
plt.show()

Finally I made a dataframe with the Principal Components and added back the date column:

whatever_df = pd.DataFrame(train_pca, columns=['PC1','PC2'])
whatever_df['Date'] = train_df['SRCDate']

And I split it into train and test sets by date:

train_df = whatever_df[whatever_df['Date'] <= max(whatever_df['Date']) - relativedelta(months = 3)]
test_df = whatever_df[whatever_df['Date'] > max(whatever_df['Date']) - relativedelta(months = 3)]

My question is, I would like to run these train/test df's through random forests but based on grouping the data by the clusters generated from the KMeans algorithm. How could I do that? Also, the target variable I'd like to predict is CprTarget from the original dataframe, so would I need to add that back in to the train and test sets for the Random Forest to work since it is a supervised model?

You feed the target to the `fit` method, so, let's say your model object is called `model`, you do `model.fit(training_featrues, target)`. Don't fusion it back to the training set since it will be considered a training feature. Scikit documentation has lots of examples on this. — Ignatius Reilly, Aug 30 '22 at 16:41
Hey there Ignatius, thanks for your reply. Yes that is a better way to put it. I would like to train a separate model for each cluster. Or I suppose I could just add the cluster identity as an additional feature instead. How could I do that? Does it make sense to have a model whose features are just: PC1, PC2, Target, and Cluster Identity? — Hefe, Aug 30 '22 at 16:45
Just add a column with the cluster identity for each row... then you decide whether you want to separate it in different DFs based on the cluster and train a model for each dataset, or keep them together... — Ignatius Reilly, Aug 30 '22 at 16:49
Why are you using the principal components for training?? You are losing information, and it doesn't seem like you have so many dimensions as to justify doing so. Did you try to do the regression without the PCA first? — Ignatius Reilly, Aug 30 '22 at 16:50
Great question. I did try doing a linear regression with all the features. That worked okay, but not excellent. This is more of a learning exercise. I may just try clustering without the PCA and then run regressions or Random Forests from there. — Hefe, Aug 30 '22 at 16:52
You need to tell the training algorithm which one is the target which columns are the training feature and which one is the target. So you enter two separate variables. You should have done this for the linear regression. — Ignatius Reilly, Aug 30 '22 at 16:53
"How can I access cluster identity?" The same way you did it for the plot. — Ignatius Reilly, Aug 30 '22 at 16:55
Yes sorry, I actually did do that in the regression. Left out the target variable in X_train/test, put it into y_train and test. Just had a brain fart. — Hefe, Aug 30 '22 at 16:55
Solution is [here](https://stackoverflow.com/a/73548612/15975987) as the answer of another question I asked — Hefe, Aug 31 '22 at 02:25

Can I group my data based on clusters generated by KMeans Clustering as input for a supervised learning model?

0 Answers0