0

I'm new to Agglomerative Clustering and doc2vec, so I hope somebody can help me with the following issue.

This is my code:

model = AgglomerativeClustering(linkage='average',
        connectivity=None, n_clusters=2)
X = model_dm.docvecs.doctag_syn0
model.fit(X, y=None)
model.fit_predict(X, y=None)

What I want is to predict the average of the distances of each observation. I got the following error:

MemoryErrorTraceback (most recent call last)
<ipython-input-22-d8b93bc6abe1> in <module>()
      2 model = AgglomerativeClustering(linkage='average',connectivity=None,n_clusters=2)
      3 X = model_dm.docvecs.doctag_syn0
----> 4 model.fit(X, y=None)
      5 

/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in fit(self, X, y)
    763 n_components=self.n_components,
    764                                        n_clusters=n_clusters,
--> 765                                        **kwargs)
    766         # Cut the tree
    767         if compute_full_tree:

/usr/local/lib64/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
    281 
    282     def __call__(self, *args, **kwargs):
--> 283         return self.func(*args, **kwargs)
    284 
    285     def call_and_shelve(self, *args, **kwargs):

/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in _average_linkage(*args, **kwargs)
    547 def _average_linkage(*args, **kwargs):
    548     kwargs['linkage'] = 'average'
--> 549     return linkage_tree(*args, **kwargs)
    550 
    551 

/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in linkage_tree(X, connectivity, n_components, n_clusters, linkage, affinity, return_distance)
    428             i, j = np.triu_indices(X.shape[0], k=1)
    429             X = X[i, j]
--> 430         out = hierarchy.linkage(X, method=linkage, metric=affinity)
    431         children_ = out[:, :2].astype(np.int)
    432 

/usr/local/lib64/python2.7/site-packages/scipy/cluster/hierarchy.pyc in linkage(y, method, metric)
    669                          'matrix looks suspiciously like an uncondensed '
    670                          'distance matrix')
--> 671         y = distance.pdist(y, metric)
    672     else:
    673         raise ValueError("`y` must be 1 or 2 dimensional.")

/usr/local/lib64/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
   1375 
   1376     m, n = s
-> 1377     dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
   1378 
   1379     # validate input for multi-args metrics

MemoryError: 
Nathan Vērzemnieks
  • 5,495
  • 1
  • 11
  • 23

1 Answers1

1

You are getting a MemoryError. This is a reliable indicator that you are running out of memory, on the line indicated.

The line indicates an attempt to allocate an np.zeros() array of (m * (m - 1)) // 2 values of type double (8 bytes). Looking at the scipy source, m, here, is the number of vectors in X, aka model_dm.docvecs.doctag_syn0.shape[0].

So, how many docvecs are you working with? If it's 200,000, you will need...

((200000 * 199999) // 2) * 8 bytes

...or about 320GB of RAM for that np.zeros() allocation to succeed. (If you have more docvecs, even more RAM.)

(Agglomerative clustering needs to know all the pairwise distances, which the scipy implementation tries to calculate and store at the beginning, which is very space-consuming.)

You may need to have more RAM, or use fewer docvecs, or use a different clustering algorithm, or use an implementation which is lazier about calculating distances (but is then much much slower because it will often be recalculating, rather than reusing, distances it needs repeatedly.

gojomo
  • 52,260
  • 14
  • 86
  • 115