I want to cluster 3.5M 300-dimensional word2vec vectors from my custom gensim model to determine whether I can use those clustering to find topic-related words. It is not the same as model.most_similar_...
, as I hope to attach quite distant, but still related words.
The overall size of the model (after normalization of vectors, i.e. model.init_sims(replace=True)
) in memory is 4GB:
words = sorted(model.wv.vocab.keys())
vectors = np.array([model.wv[w] for w in words])
sys.getsizeof(vectors)
4456416112
I tried both scikit's DBSCAN and some other implementations from GitHub, but they seem to consume more and more RAM during processing and crash with std::bad_alloc
after some time. I have 32 GB of RAM and 130GB swap.
Metric is euclidean, I convert my cosine distance threshold cos=0.48 as eps=sqrt(2-2*0.48), so all the optimizations should be applied.
The problem is that I don't know the number of clusters and want to determine them by setting the threshold for closely related words (let it be cos<0.48
or d_l2 < sqrt(2-2*0.48)
). DBSCAN seems working on small subsets, but I can't pass the computation on the full data.
Is there any algorithm or workaround in Python which can help with that?
EDIT: Distance matrix seem to be for a size(float)=4bytes: 3.5M*3.5M*4/1024(KB)/1024(MB)/1024(GB)/1024(TB) = 44.5 TB, so it's impossible to precompute it.
EDIT2: Currently trying ELKI, but cannot make it to cluster data on toy subset properly.