I have taken this dataframe train_df
and run it through a PCA to reduce it down to two dimensions.
To generate the principal components I used:
pca = PCA(n_components = 2)
pipe = Pipeline([('scaler', StandardScaler()), ('pca', pca)])
train_pca = pipe.fit_transform(train_df)
Next I wanted to run this through a KMeans Clustering algorithm. To assess how many clusters to use I plotted the following
wcss = []
r = range(1, 20)
for k in r:
km = KMeans(n_clusters = k, init='k-means++', max_iter=300, n_init=10)
km.fit(train_pca)
wcss.append(km.inertia_)
plt.plot(r, wcss)
plt.title('Within Cluster Sum of Squares')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
I used KneeLocator
to land on 5 clusters then ran the KMeans and plotted below:
labels = ['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5']
km = KMeans(n_clusters = 5, init = 'k-means++', max_iter = 300, n_init = 10)
pred_y = km.fit_predict(train_pca)
plt.figure(figsize = (12,10))
plot = plt.scatter(train_pca[:,0], train_pca[:,1], c = pred_y)
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=300, c='red')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Principle Components with Clusters')
plt.legend(handles=plot.legend_elements()[0], labels = labels)
plt.show()
Finally I made a dataframe with the Principal Components and added back the date column:
whatever_df = pd.DataFrame(train_pca, columns=['PC1','PC2'])
whatever_df['Date'] = train_df['SRCDate']
And I split it into train and test sets by date:
train_df = whatever_df[whatever_df['Date'] <= max(whatever_df['Date']) - relativedelta(months = 3)]
test_df = whatever_df[whatever_df['Date'] > max(whatever_df['Date']) - relativedelta(months = 3)]
My question is, I would like to run these train/test df's through random forests but based on grouping the data by the clusters generated from the KMeans algorithm. How could I do that? Also, the target variable I'd like to predict is CprTarget
from the original dataframe, so would I need to add that back in to the train and test sets for the Random Forest to work since it is a supervised model?