Clustering text. Chatintets library Python. HBDSCAN, UMAP

Question

I'm using chatintents (https://github.com/dborrelli/chat-intents) for automatically clustering. To embed sentences I use sentence transformers. The problem is when I set the maximum and minimum number of clusters and then run, the number of clusters it finds is higher or lower.

The code:

X = model.encode(utterances["FCD_COG_INPUT_TEXT"].to_list()) 

hspace = {
    "n_neighbors": hp.choice('n_neighbors', range(3,16)),
    "n_components": hp.choice('n_components', range(100,115)),
    "min_cluster_size": hp.choice('min_cluster_size', range(50,65)),
    "random_state": 42
}

label_lower = 20
label_upper = 30
max_evals = 100

best_params_use, best_clusters_use, trials_use = bayesian_search(X, 
                                                                 space=hspace, 
                                                                 label_lower=label_lower, 
                                                                 label_upper=label_upper, 
                                                                 max_evals=max_evals)

And the results:

100%|██████████| 100/100 [59:49<00:00, 35.90s/trial, best loss: 0.15540102619497703] 
best:
{'min_cluster_size': 51, 'n_components': 106, 'n_neighbors': 7, 'random_state': 42}
label count: 3

In this case, 3 clusters. But sometimes more than 100

score 0 · Answer 1 · answered Apr 13 '23 at 11:33

0

Chat-intents add a penalty term to the objective function when the label count is outside of the label range. It does not mean that the number of cluster will always falls inside the range given.

answered Apr 13 '23 at 11:33

CnydoX

1

Clustering text. Chatintets library Python. HBDSCAN, UMAP

1 Answers1