0

I'm using chatintents (https://github.com/dborrelli/chat-intents) for automatically clustering. To embed sentences I use sentence transformers. The problem is when I set the maximum and minimum number of clusters and then run, the number of clusters it finds is higher or lower.

The code:

X = model.encode(utterances["FCD_COG_INPUT_TEXT"].to_list()) 

hspace = {
    "n_neighbors": hp.choice('n_neighbors', range(3,16)),
    "n_components": hp.choice('n_components', range(100,115)),
    "min_cluster_size": hp.choice('min_cluster_size', range(50,65)),
    "random_state": 42
}

label_lower = 20
label_upper = 30
max_evals = 100

best_params_use, best_clusters_use, trials_use = bayesian_search(X, 
                                                                 space=hspace, 
                                                                 label_lower=label_lower, 
                                                                 label_upper=label_upper, 
                                                                 max_evals=max_evals) 

And the results:

100%|██████████| 100/100 [59:49<00:00, 35.90s/trial, best loss: 0.15540102619497703] 
best:
{'min_cluster_size': 51, 'n_components': 106, 'n_neighbors': 7, 'random_state': 42}
label count: 3 

In this case, 3 clusters. But sometimes more than 100

1 Answers1

0

Chat-intents add a penalty term to the objective function when the label count is outside of the label range. It does not mean that the number of cluster will always falls inside the range given.

CnydoX
  • 1