0

I want to cluster 3.5M 300-dimensional word2vec vectors from my custom gensim model to determine whether I can use those clustering to find topic-related words. It is not the same as model.most_similar_..., as I hope to attach quite distant, but still related words.

The overall size of the model (after normalization of vectors, i.e. model.init_sims(replace=True)) in memory is 4GB:

words = sorted(model.wv.vocab.keys())
vectors = np.array([model.wv[w] for w in words]) 
sys.getsizeof(vectors)
4456416112

I tried both scikit's DBSCAN and some other implementations from GitHub, but they seem to consume more and more RAM during processing and crash with std::bad_alloc after some time. I have 32 GB of RAM and 130GB swap.

Metric is euclidean, I convert my cosine distance threshold cos=0.48 as eps=sqrt(2-2*0.48), so all the optimizations should be applied.

The problem is that I don't know the number of clusters and want to determine them by setting the threshold for closely related words (let it be cos<0.48 or d_l2 < sqrt(2-2*0.48)). DBSCAN seems working on small subsets, but I can't pass the computation on the full data.

Is there any algorithm or workaround in Python which can help with that?

EDIT: Distance matrix seem to be for a size(float)=4bytes: 3.5M*3.5M*4/1024(KB)/1024(MB)/1024(GB)/1024(TB) = 44.5 TB, so it's impossible to precompute it.

EDIT2: Currently trying ELKI, but cannot make it to cluster data on toy subset properly.

Slowpoke
  • 1,069
  • 1
  • 13
  • 37
  • 1
    sklearn's DBSCAN is, unfortunately, memory intensive. If you choose a too large threshold, it needs O(n²) memory. Usual implementations of DBSCAN should only use O(n) memory. However, with 300 dimensional vectors, choosing the epsilon parameter for DBSCAN *will* be difficult with any implementation. Maybe that is where your problem is, because I find ELKI to be easy to use (make sure to add an index for large data, it does not do this automatically!) – Has QUIT--Anony-Mousse Jan 17 '18 at 09:07
  • What does not "properly" work in ELKI? Can you be more precise? – Erich Schubert Jan 17 '18 at 13:39
  • @ErichSchubert Sorry, everything seems working correctly today, perhaps I did something wrong yesterday. – Slowpoke Jan 17 '18 at 14:50
  • 1
    Thanks @Anony-Mousse, I have already achieved correct results on toy dataset and will try to process full data in ELKI and write here the results – Slowpoke Jan 17 '18 at 14:51

0 Answers0