I have a dataset the has 10 different groups and sales for 3 weeks in a year. I am trying to run a clustering algorithm that clusters each of the groups based on how many items are present in each group. Basically, I want to treat each group differently.
I tried a manual approach and set the clusters of each group relative to the group with the highest number of items but I want to make the code find the optimal k for the kmeans for each group. I am not sure how to determine the best k for each of the groups
Here is the data:
items_scaled =
PROD LOB_LABEL 1 2 3
100001 Books 0.022556 0.020326 0.020556
100002 Books 0.023756 0.080306 0.020656
100003 Candles 0.022966 0.020178 0.020291
100004 Shoes 0.021067 0.020485 0.019420
100005 Candles 0.020403 0.021067 0.020556
100006 Shoes 0.023634 0.026219 0.029357
100007 Books 0.022472 0.017218 0.016454
100008 Pens 0.023670 0.027971 0.029763
100009 Pens 0.037894 0.026664 0.031777
100010 Shoes 0.015929 0.015205 0.015446
....
Here is my trial on how to find the best k but the runtime really high. At this rate, it might take a day or more to run. I am working with 3500 rows of data. Is there a better/optimal way to achieve my result?
silhouette = []
# count = 0
# K = range(2, len(items))
for lob in item_df['LOB_LABEL'].unique():
items = item_df[item_df['LOB_LABEL']==lob]
items_scaled = items.iloc[:, 2:54]
K = range(2, len(items))
for k in tqdm(K):
kmeanModel = TimeSeriesKMeans(n_clusters=k, metric="dtw", n_jobs=6, max_iter=10, n_init=5)
kmeanModel.fit(items_scaled)
silhouette.append(silhouette_score(items_scaled, kmeanModel.labels_, metric="dtw"))
print(lob,max(silhouette),k)
# if count == 0:
# break
# count += 1