1

I am trying to create a distance matrix to run the DBSCAN algorithm for clustering purposes. The final distance matrix has 174,000 X 174,000 entries that are all floating numbers between 0 and 1. I have the individual lists (all 174,000 of them) saved with numbers saved as int in them, but when trying to consolidate into an array, I keep running out of memory.

Is there a way to compress the data (I have tried hdf5, but that also seems to struggle) that can deal with such a large data set?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 1
    Are most of them zeros? because then you could store it as a dense matrix. Furthermore do you use `numpy`? – Willem Van Onsem Jan 24 '17 at 18:52
  • @HighPerformanceMark: of course not per *se*, but usually it will be more compact and most algorithm collections like LAPACK for instance have specific implementations for sparse matrices that can outperform their dense counterparts since the loops tend to process less elements. – Willem Van Onsem Jan 24 '17 at 19:02
  • Argh, I wrote dense instead of sparse, my mistake. fuuuu.... – Willem Van Onsem Jan 24 '17 at 19:03
  • Are these python floats and lists or are you using `numpy` or `pandas`? – tdelaney Jan 24 '17 at 19:52
  • Hi, These are python lists as of now. Would converting them to numpy arrays help with the memory issue? Also, all the values are between 0 and 1, but none of them are exactly 0 or 1, so storing it as a sparse matrix is not an option, I don't think. – Devdeepta Bose Jan 24 '17 at 21:56
  • Do switch to `numpy`, check this out: http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists – Shane Jan 25 '17 at 10:36
  • Thank you so much, Shane. I will try converting everything into numpy arrays of dtype f2 or similar, and report back. I think this, along with some hdf5 compression, will seriously help. Thank you to everyone that replied, I sincerely appreciate your help! – Devdeepta Bose Jan 25 '17 at 17:02
  • DBSCAN does **not use a distance matrix**. So no need to solve this problem. – Has QUIT--Anony-Mousse Jun 23 '17 at 20:47

0 Answers0