0

I'm trying to cluster similar messages within machine log files (where e.g. I can't ignore numbers). Debugging my code with a subset of messages which all have the same "degree of similarity" I came across a very strange finding: below a certain number of these messages HDBScan produces the expected result (which is all message belong to one or no cluster) but above a certain number of messages HDBScan suddenly starts finding different clusters which intuitively doesn't make sense to me.

And even more strange: that limit where I start seeing multiple clusters is 18 messages for the 'generic' algorithm within HDBScan but 61 when using 'best'. Well, maybe 'best' selects 'generic' above 60 messages, not sure how to verify that...

I've tried various settings for min_cluster_size and min_samples and various distance metrics but the issue remains the same. Below please find some self-contained code to see if you can reproduce the issue. Just change n_msg to any number >= 18 (or >=61 when using 'best') and you should get more than one cluster ID. The code also prints the cosine-similarity matrix so you can see how symmetrical this example is.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import hdbscan
import pandas as pd

# set number of dummy messages to be created
n_msg = 17

# create dummy messages with three identical and one variable term
msgs = pd.DataFrame()
for i in range(0, n_msg):
    msg = ['bli bla blub ' + str(i)]
    msgs = msgs.append(msg)
msgs.columns = ['msg']

# tokenizer to split at space only so numbers will not be ignored
def space_tokenizer(msg):
   return msg.split()

# Vectorize dummy messages
TFvectorizer = CountVectorizer(tokenizer = space_tokenizer)
msgs_vect = TFvectorizer.fit_transform(msgs.msg)

# Compute cosine similarity between message vectors
msgs_CosSim = pd.DataFrame(cosine_similarity(msgs_vect, msgs_vect))
print(msgs_CosSim,'\n')

# Cluster the cosine similarity results
hdbs = hdbscan.HDBSCAN(min_cluster_size=2, min_samples=1, metric='euclidean', \
                       algorithm='generic').fit(msgs_CosSim)
CosSim_clstr_ID = pd.DataFrame(hdbs.labels_)
CosSim_clstr_ID.columns = ['msg_ID']
print(CosSim_clstr_ID,'\n')

# Check number of cluster IDs generated
print('Number of cluster IDs:', len(CosSim_clstr_ID.msg_ID.unique()))

So again, with these dummy messages above I would expect the same result (all message belong to one cluster ID or the outlier cluster -1) independent from the number of messages but I start getting different clusters above a certain number of messages (depending on the algorithm choice within HDBScan).

Any idea what's going on?

Update: did some more research on this using allow_single_cluster=True only (!) and looped through all algorithm values as well as different message types (between one and four fixed terms plus the numbers) and here is the result where 'generic', whilst being the fastest, seems to be most likely to produce strange results randomly: enter image description here

MarkH
  • 122
  • 9
  • BTW, if you allow HDBScan to build single clusters via ````allow_single_cluster=True```` the problem is gone with these dummy messages at least... – MarkH Sep 19 '19 at 06:33
  • More testing revealed that ````allow_single_cluster=True```` does not do the trick reliably... If you make the messages shorter (````msg = ['bli ' + str(i)]````) and use ````algorithm='generic'```` already ````n_max = 10```` will produce a bizarre result where messages 0-3 and 6-9 are cluster ````0```` and messages 4-5 are outliers (cluster ````-1````) – MarkH Sep 20 '19 at 06:42

0 Answers0