How to find Optimal K for each group in data with Kmeans clustering

Question

I have a dataset the has 10 different groups and sales for 3 weeks in a year. I am trying to run a clustering algorithm that clusters each of the groups based on how many items are present in each group. Basically, I want to treat each group differently.

I tried a manual approach and set the clusters of each group relative to the group with the highest number of items but I want to make the code find the optimal k for the kmeans for each group. I am not sure how to determine the best k for each of the groups

Here is the data:

items_scaled = 
PROD    LOB_LABEL    1         2           3    
100001  Books   0.022556    0.020326    0.020556    
100002  Books   0.023756    0.080306    0.020656
100003  Candles 0.022966    0.020178    0.020291    
100004  Shoes   0.021067    0.020485    0.019420    
100005  Candles 0.020403    0.021067    0.020556
100006  Shoes   0.023634    0.026219    0.029357
100007  Books   0.022472    0.017218    0.016454
100008  Pens    0.023670    0.027971    0.029763
100009  Pens    0.037894    0.026664    0.031777
100010  Shoes   0.015929    0.015205    0.015446    
....

Here is my trial on how to find the best k but the runtime really high. At this rate, it might take a day or more to run. I am working with 3500 rows of data. Is there a better/optimal way to achieve my result?

silhouette = []
# count = 0
# K = range(2, len(items))
for lob in item_df['LOB_LABEL'].unique():
    items = item_df[item_df['LOB_LABEL']==lob]
    items_scaled = items.iloc[:, 2:54]
    K = range(2, len(items))
    for k in tqdm(K):
        kmeanModel = TimeSeriesKMeans(n_clusters=k, metric="dtw", n_jobs=6, max_iter=10, n_init=5)
        kmeanModel.fit(items_scaled)
        silhouette.append(silhouette_score(items_scaled, kmeanModel.labels_, metric="dtw"))
        print(lob,max(silhouette),k)
#     if count == 0:
#         break
#     count += 1

You should reformat your title and content to reflect an actual question. — Kroshtan, Jan 10 '22 at 14:10
I have answered, but for future reference: This sounds like a theoretical machine learning question, rather than a coding question and is therefore a better fit on Cross Validated. If you have specific questions about implementation, then post them on Stack Overflow. — Kroshtan, Jan 10 '22 at 14:56

Kroshtan · Accepted Answer · 2022-01-11T12:49:55.873

1

The optimal number of clusters is based either on your presumptions, e.g. equal to the highest number of items, or you can determine it empirically. To do that, you run the algorithm for different numbers of k and calculate the error of the clustering, for example by calculating MSE between all members of a cluster and the cluster center. Then you'd have to make a decision between the acceptable error (which most likely decreases with the number of clusters) and whether a larger number of clusters still makes sense for the task at hand.

To reduce time complexity of the empirical approach, you can change three variables: the maximum for K, the number of iterations and the number of samples used in the parameter sweep. If you want the most optimal solution, you are best served to take the day to run this. However, if you are hard pressed for time or suspect you need to rerun this at some point, I advise you to use a subset of your data for hyperparameter searches.

More practically speaking, I think you will find k << len(items, so your search range can probably be greatly reduced. Combine that with a subset of the data points and you should save a lot of time.

edited Jan 11 '22 at 12:49

answered Jan 10 '22 at 14:56

Kroshtan

637
5
17

thank you very much. My question is how to implement this really – Adelore Similoluwa Gloria Jan 10 '22 at 15:46
At Stack Overflow, we do not write code for you, we help you fix issues. What have you tried? Which part is not working, throwing an error or producing wrong results? – Kroshtan Jan 10 '22 at 15:51
I have made adjustments to my question. It now includes my code on how i tried to tackle the problem. Please let me know what you think – Adelore Similoluwa Gloria Jan 10 '22 at 17:28
@AdeloreSimiloluwaGloria I have updated my answer to reflect your updated question. Please let me know if this is sufficiently informative by upvoting/accepting the answer. If not, note additional issues by updating the question or leaving a comment. – Kroshtan Jan 11 '22 at 12:51

How to find Optimal K for each group in data with Kmeans clustering

1 Answers1