I'm using KBinsDiscretizer
to cluster my data to four categories using kmeans
algorithm as follow. The goal is to have 4 clusters based on the value of avg_error
. The code works properly and returns 4 clusters as:
0: very low error rate,
1: low error rate,
2: high error rate, and
3: very high error rate.
The number of data points in the two last cluster (2: high error rate, and 3: very high error rate) are very low. I need to find a way to influence the results so that it assigns more data points to these two clusters. Is it possible to do and if so, how?
enc = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy="kmeans")
grouped = df.groupby('day')
clustered = pd.DataFrame()
for name, group in grouped:
group["cluster"] = enc.fit_transform(group.avg_error.values.reshape(-1,1))
clustered = clustered.append(group)