I'm currently trying to use HDBSCAN to cluster a bunch of movie data, in order to group similar content together and be able to come up with 'topics' that describe those clusters. I'm interested in HDBSCAN because I'm aware that it's considered soft clustering, as opposed to K-Means, which would be more suitable for my goal.
After performing HDBSCAN, I was able to find with movies were placed in each cluster. What I now wanted was to which terms/words represented each cluster.
I've done something similar with KMeans (code below):
model = KMeans(n_clusters=70)
model.fit(text)
clusters=model.predict(text)
model_labels=model.labels_
output= model.transform(text)
titles=[]
for i in data['title']:
titles.append(i)
genres=[]
for i in data['genres']:
genres.append(i)
films_kmeans = { 'title': titles, 'info': dataset_list2, 'cluster': clusters, 'genre': genres }
frame_kmeans= pd.DataFrame(films_kmeans, index=[clusters])
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
for i in range(70):
print("Cluster %d:" % i),
for ind in order_centroids[i, :5]:
print(' %s' % tfidf_feature_names[ind]),
print()
print()
print("Cluster %d titles:" % i, end='')
for title in frame_kmeans.loc[i]['title'].values.tolist():
print(' %s,' % title, end='')
print() #add whitespace
print() #add whitespace
print()
While this works fine for KMeans, I couldn't find a similar way to do this for HDBSCAN, as I'm aware it doesn't have cluster centers. I have been looking at the documentation, but I'm pretty new at this and I haven't been able to fix my issue.
Any ideas would be very much appreciated! Thank you for your time.