How to visualise top terms on each HDBSCAN cluster

Question

I'm currently trying to use HDBSCAN to cluster a bunch of movie data, in order to group similar content together and be able to come up with 'topics' that describe those clusters. I'm interested in HDBSCAN because I'm aware that it's considered soft clustering, as opposed to K-Means, which would be more suitable for my goal.

After performing HDBSCAN, I was able to find with movies were placed in each cluster. What I now wanted was to which terms/words represented each cluster.

I've done something similar with KMeans (code below):

model = KMeans(n_clusters=70)
model.fit(text)
clusters=model.predict(text)
model_labels=model.labels_
output= model.transform(text)

titles=[]
for i in data['title']:
        titles.append(i)
genres=[]
for i in data['genres']:
        genres.append(i)

films_kmeans = { 'title': titles, 'info': dataset_list2, 'cluster': clusters, 'genre': genres }
frame_kmeans= pd.DataFrame(films_kmeans, index=[clusters])

print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = model.cluster_centers_.argsort()[:, ::-1] 
for i in range(70):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :5]:
        print(' %s' % tfidf_feature_names[ind]),
    print()
    print()

    print("Cluster %d titles:" % i, end='')
    for title in frame_kmeans.loc[i]['title'].values.tolist():
        print(' %s,' % title, end='')
    print() #add whitespace
    print() #add whitespace

print()

While this works fine for KMeans, I couldn't find a similar way to do this for HDBSCAN, as I'm aware it doesn't have cluster centers. I have been looking at the documentation, but I'm pretty new at this and I haven't been able to fix my issue.

Any ideas would be very much appreciated! Thank you for your time.

Rricha Jalota · Accepted Answer · 2019-09-05T20:58:42.507

I ran into a similar problem and taking the lead from @ajmartin's advice, the code below worked for me. Assuming you have a list of labels - label containing the original labels for each point and an HDBSCAN object, clusterer = hdbscan.HDBSCAN(min_cluster_size=10).fit(X),

from operator import itemgetter
from collections import defaultdict

def get_top_terms(k):
    top_terms = defaultdict(list)
    for c_lab, prob, text_lab in zip(clusterer.labels_, clusterer.probabilities_, label):
        top_terms[c_lab].append((prob, text_lab))

    for c_lab in top_terms:
        top_terms[c_lab].sort(reverse=True, key=itemgetter(0)) # sort the pair based on probability 

    # -- print the top k terms per cluster --    
    for c_lab in top_terms:
        print(c_lab, top_terms[c_lab][:k])
    return top_terms

# -- for visualization (add this snippet before plt.scatter(..))--
from collections import Counter

plt.figure(figsize=(16, 16))
plt.title('min_cluster_size=10')

plot_top=Counter() # to get only distinct labels, replace with a set and add a check here [1] 
top_terms = get_top_terms(10)

for i, lab, prob in zip(range(len(clusterer.labels_)),clusterer.labels_, clusterer.probabilities_): # pointwise iteration
    if plot_top[lab] < 10:      
        for el in top_terms[lab][:10]:
            if prob == el[0]: # [1] 
                plot_top[lab] += 1
                # x[i], y[i] are the projected points in 2D space 
                plt.annotate(el[1], (x[i],y[i]), horizontalalignment='center', verticalalignment='center', size=9.5)
                break

score 0 · Answer 2 · answered Aug 01 '19 at 09:19

Reference the HDBSCAN tutorial. For each sample clustered by the algorithm, it also associates a probability which can be thought of as how strongly is the sample associated to the cluster. You can filter the samples for each cluster and their corresponding probabilities; use the probabilities to determine top points for each cluster. The link has more details.

How to visualise top terms on each HDBSCAN cluster

2 Answers2