1

I want to cluster 1,5 million of chemical compounds. This means having 1.5 x 1.5 Million distance matrix...

I think I can generate such a big table using pyTables but now - having such a table how will I cluster it?

I guess I can't just pass pyTables object to one of scikit learn clustering methods...

Are there any python based frameworks that would take my huge table and do something useful (lie clustering) with it? Perhaps in distributed manner?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
mnowotka
  • 16,430
  • 18
  • 88
  • 134
  • Why does it need to be python? For this size of data, the natural way to go about it is to solve it in a standalone process via dedicated software. Normally a matrix such as this would either be very sparse, or it can easily be considered sparse by applying some weight threshold. In that case it can also be considered a graph clustering problem. – micans Jan 15 '14 at 14:43
  • Because that's the question I asked. If you don't know the answer, why are you commenting? – mnowotka Jan 15 '14 at 14:45
  • 1
    I happen to know a bit about clustering, and it seems odd that you are hung up on a particular software language for what is a large-scale data mining problem. Are you trying to solve a problem or do you just like being snarky? It is a genuine question -- why does it have to be python? – micans Jan 15 '14 at 14:56
  • It doesn't necessarily to be python but it would be nice to have as almost all my environment is python based. I don't like saying python is not good for handling large amount of data because it's not true and pytables are the best example. It needs to be based on open source software, it needs to be done on standard headless linux machine. And you just said that python is bad (because it's python or what?) but you didn't say what is a good solution. – mnowotka Jan 15 '14 at 15:29

2 Answers2

4

Maybe you should look at algorithms that don't need a full distance matrix.

I know that it is popular to formulate algorithms as matrix operations, because tools such as R are rather fast at matrix operation (and slow on other things). But there is a whole ton of methods that don't require O(n^2) memory...

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • For this answer to be constructive - can you give some examples of such methods? – mnowotka Jan 15 '14 at 14:50
  • I'll add RNSC, Louvain method, and MCL. – micans Jan 23 '14 at 13:28
  • Mean Shift, SLINK, CLINK, GMM-EM, Canopy preclustering, ... actually I believe the majority of clustering algorithms isn't distance-matrix based. – Has QUIT--Anony-Mousse Jan 23 '14 at 20:38
  • Conceptually there is a distance or similarity matrix in most (all?) algorithms I would say, but there is no need to keep track of all of it. It is fine to do a one-off all vs all computation and store a sparse representation, or alternatively build an index of some sort. – micans Jan 24 '14 at 15:14
1

I think the main problem is memory. 1,5 x 1,5 million x 10B (1 element size) > 20TB You can use bigdata database like pyTables, Hadoop http://en.wikipedia.org/wiki/Apache_Hadoop and MapReduce algorithm.

Here some guides: http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html

Or use Google App Engine Datastore with MapReduce https://developers.google.com/appengine/docs/python/dataprocessing/ - but now it isn't production version

jacek2v
  • 571
  • 4
  • 8