4

There are similar questions and libraries like ELI5 and LIME. But I couldn't find a solution to my problem. I have a set of documents and I am trying to cluster them using scikit-learn's DBSCAN. First, I am using TfidfVectorizer to vectorize the documents. Then, I simply cluster the data and receive the predicted labels. My question is: How can I explain the reason why a cluster has formed? I mean, imagine there are 2 predicted clusters (cluster 1 and cluster 2). Which features (since our input data is vectorized documents, our features are vectorized "words") are important for the creation of the cluster 1 (or cluster 2)?

Below you can find a minimal example of what I am currently working on. This is not a minimal working example of what I am trying to achieve (since I don't know how).

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers'),
)

visualize_train_data = pd.DataFrame(data=np.c_[twenty_train
                                                      ['data'], twenty_train
                                                      ['target']])
print(visualize_train_data.head())

vec = TfidfVectorizer(min_df=3, stop_words='english',
                      ngram_range=(1, 2))
vectorized_train_data = vec.fit_transform(twenty_train.data)

clustering = DBSCAN(eps=0.6, min_samples=2).fit(vectorized_train_data)
print(f"Unique labels are {np.unique(clustering.labels_)}")

Side notes: The question I provided focuses on specifically the k-Means algorithm and the answer isn't very intuitive (for me). ELI5 and LIME are great libraries but the examples provided by them are either regression or classification related (not clustering) and their regressors and classifiers support "predict" directly. DBSCAN doesn't...

MehmedB
  • 1,059
  • 1
  • 16
  • 42
  • Maybe unrelated to your question, but I think the main question, before answering yours, is why did you use DBSCAN? Any special reason? you think the geometric structure of your data means something, so instead of K-means, you came over DBSCAN? or any other reason? – alift Sep 17 '20 at 08:46
  • @alift Because my datasets might contain outliers. DBSCAN is really good at finding them. – MehmedB Sep 17 '20 at 08:48
  • 1
    I see your point, however, for overcoming the outliers, DBSCAN is not a general solution IMHO. When you use DBSCAN, you have some pre-assumptions that the connectivity between the data means something, like manifold examples. Anyway, I think for reasoning about the clusters, that is a common question after clustering. A random guess, is to check the clusters, and see if, for some, the concepts docs are talking about are near together? like if in one cluster, all have politic-ish subjects, while in another one it is all about sports – alift Sep 17 '20 at 08:53
  • @alift That is what I want to get. I want to know feature ranking (important "words", or topics) for "cluster 1" and "cluster 2". – MehmedB Sep 17 '20 at 09:01
  • If we assume that you are sure about the quality of the clustering, and you wanna just explain what words imply to get to cluster1 and cluster2, why not starting by a distribution of the words inside each cluster? imagine you have 5 words, cluster1 has dist like 100,0,0,200,0 and cluster2 has 10,20,1000,30,4 then you can guess word1 and word4 are explaining the cluster1 and word3 is the lead for cluster2. – alift Sep 17 '20 at 09:07

2 Answers2

4

DBSCAN, as most of clustering algorithms in sklearn, doesn't provide you predict method or feature importances. So you can either (1) reconstruct the decision process by training logistic regression or whatever else interpretable classifier using cluster labels, or (2) switch to another text clustering method, such as NMF or LDA. The first approach is exactly what Lime and the likes do.

nad_rom
  • 419
  • 4
  • 9
3

First, let's understand what is the embedding space you work with. TfidfVectorizer will create a very sparse matrix one dimension of which correspond to the sentences, and the other to your vocabulary (all the words in text, besides "stop words" and very uncommon - see min_df and stop_words). When you ask DBSCAN to cluster sentences, it takes those representations of words tf-idfs, and finds sentences which are close to each other using euclidian distance metric. So your clusters hopefully should be created out of sentences which have common words. In order to find which words (or "features") are most important in the specific cluster, just take the sentences that belong to the same cluster (rows of the matrix), and find top K (say ~10) indices of the columns that have most common non-zero values. Then lookup what those words are using vec.get_feature_names()

update

cluster_id = 55   # select some cluster
feat_freq = (vectorized_train_data[(clustering.labels_== cluster_id)] > 0).astype(int).sum(axis=0)  # find frequencies of the words
max_idx = np.argwhere(feat_freq == feat_freq.max())[:,1]  # select those with maximum frequency
for i in max_idx:
    print(i, vec.get_feature_names()[i])
    

Please note that clusters you get here are really small. The cluster 55 has only 4 sentences. Most of the others have only 2 sentences.

igrinis
  • 12,398
  • 20
  • 45
  • I'm still trying to do this using scipy.sparse.csr_matrix because this is the format I get from TfidfVectorizer (and memory wise this is more efficient than dense arrays). But I am not good with sparse formats. Your answer makes sense but I have some concerns about speed and memory usage. If you have any other suggestions please tell me. (Or a piece of code would be awesome :D ) – MehmedB Sep 23 '20 at 07:15
  • OK, I've made some progress. I am trying to use tf-idf features instead of 'raw most common words'. I will probably give you the bounty. Thanks for the help. – MehmedB Sep 24 '20 at 06:38