4

I want to cluster 2d points (latitude/longitude) on a map. The number of points is 400K so the input matrix would be 400k x 2.

When I run scikit-learn's Agglomerative Clustering I run out of memory and my memory is about 500GB.

class sklearn.cluster.AgglomerativeClustering(n_clusters=2, affinity='euclidean', memory=Memory(cachedir=None), connectivity=None, n_components=None, compute_full_tree='auto', linkage='ward', pooling_func=<function mean at 0x2b8085912398>)[source]

I also tried the memory=Memory(cachedir) option with no success. Does anybody have a suggestion (another library or change in the scikit code) so that I can run the clustering algorithm on the data?

I have run the algorithm successfully on small datasets.

Ash
  • 3,428
  • 1
  • 34
  • 44
  • 1
    Are you interested in some out-of-the-box solution, or do you want to solve the problem? I'm not sure about scikit-learn, but I can imagine some boosting-based method to solve this issue and I can write it down if you'd want me to. – Filip Malczak Aug 30 '15 at 08:01
  • @Salvador Dali: for the memory parameter, I have created a directory /tmp/memory_cache and set the memory parameter to memory=Memory('/tmp/memory_cache'). – Ash Sep 01 '15 at 08:19
  • @FilipMalczak: I'm mostly interested in some out-of-the-box solution but it can be another toolbox in C++ or any other language and I can add it to my pipleline. – Ash Sep 01 '15 at 08:21
  • Then sorry, I've got nothing. – Filip Malczak Sep 01 '15 at 19:18

0 Answers0