-1

I'm trying to find clusters in a data set using K-means method. I got the number of clusters from the elbow method, but don't know how to identify and separate these clusters for further analysis on each cluster like applying linear regression on each cluster. My data set contain more than two variables.

I got the number of clusters from the elbow method

Applying Kmeans

distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(df)
kmeanModel.fit(df)
distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1))**2 / df.shape[0])

Elbow method for number of clusters

plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
Bhavishya
  • 9
  • 1
  • 3

1 Answers1

1

Suppose you found that the value k is the optimal number of clusters for your data using the Elbow method.

So you can use the following code to divide the data into different clusters:

kmeans = KMeans(n_clusters=k, random_state=0).fit(df)
y = kmeans.labels_    # Will return the cluster numbers for each datapoint
y_pred = kmeans.predict(<unknown_sample>)    # If want to predict for a new sample

After that you can separate the data based on the clusters as:

for i in range(k):
    cluster_i = df[y == i, :]    # Subset of the datapoints that have been assigned to the cluster i
    # Do analysis on this subset of datapoints.

You can find more details related to different parameters in this link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

ranka47
  • 995
  • 8
  • 25