-1

I'm using KBinsDiscretizer to cluster my data to four categories using kmeans algorithm as follow. The goal is to have 4 clusters based on the value of avg_error. The code works properly and returns 4 clusters as:

0: very low error rate,

1: low error rate,

2: high error rate, and

3: very high error rate.

The number of data points in the two last cluster (2: high error rate, and 3: very high error rate) are very low. I need to find a way to influence the results so that it assigns more data points to these two clusters. Is it possible to do and if so, how?

enc = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy="kmeans")

grouped = df.groupby('day')
clustered = pd.DataFrame()
for name, group in grouped:
  group["cluster"] = enc.fit_transform(group.avg_error.values.reshape(-1,1))
  clustered = clustered.append(group)
Birish
  • 5,514
  • 5
  • 32
  • 51
  • 1
    Did you check out https://imbalanced-learn.org/ to use balancing techniques and increase your number of samples? – Ankur Sinha Aug 16 '19 at 10:29
  • 1
    Another reference: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/examples/over-sampling/plot_comparison_over_sampling.py – Ankur Sinha Aug 16 '19 at 10:31
  • While you can use some balancing techniques, I find this question strange. It sounds like you have numbers `1,1,...,1, 10,10,...,10, 100,...,100,1000,...,1000`, and since you have too few `100`-s and `1000`-s, you want to also group some `10`-s with them. If I understand the situation correctly, it doesn't make much sense. Why do you need it? –  Aug 16 '19 at 10:33

1 Answers1

0

The kmeans strategy tries to optimize the statistical quantity of the squared error. So what quantity would you want to optimize instead?

On your data, you may then as well simple predefine the thresholds manually, rather than optimizing them.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194