-1

I have a large data frame (375,000 row and 5 columns), all variables are numerical. I would like to spatio-temporal cluster this data frame using hierarchical clustering in R. However, when I try to calculate the distance matrix, I get the following error: "Negative length vectors are not allowed in distance function". Is it because of exceeding the maximum memory my computer has (16 GB RAM)? or is it due to exceeding the maximum length of any vector in R which is 2^31 - 1 (around 2 billions) elements? By the way, how to calculate the length of this distance matrix that I am trying to compute? is it 375,000^2 which equals nearly 100 billion? In any case, what can I do regarding this problem? Can I somehow still use hierarchical clustering in this case?

Clustering using kmeans works perfectly but my supervisor prefers hierarchical clustering.

Any hints/suggestions will be greatly appreciated

P.S. Rows represent vehicle trips IDs, and columns represent: longitude of starting point, latitude of starting point, longitude of end point, latitude of end point and time of trip on specific day (all values are scaled for all variables).

Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
Rayan
  • 57
  • 6
  • this link maybe useful https://stackoverflow.com/questions/13077476/hclust-size-limit – Shirin Yavari Sep 24 '18 at 22:48
  • Thank you. I think the OP there had memory problem in the first place as he got "Cannot allocate vector of 5GB" error in addition to the negative length vectors error. At the end he used machine with more RAM and solved the problem. However I get only the negative length error. – Rayan Sep 25 '18 at 18:04
  • 1
    it's mentioned there that: "You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets). Many people have had success with DBSCAN apparently" – Shirin Yavari Sep 25 '18 at 19:25
  • The thing is, as mentioned in the original post above, that my supervisor insists on using hierarchical clustering since it is "well-structured". I have tried DBSCAN but in my case it doesn't work as the data points are close to each other so the density is nearly the same everywhere. The DBSCAN results in one very big cluster that contains most of the points, and a very few other clusters with very few elements! – Rayan Sep 25 '18 at 20:37
  • ... which may well be the correct answer. There is no rule in clustering that says clusters must be small. – Has QUIT--Anony-Mousse Sep 26 '18 at 06:43
  • Maybe, but this way it is useless to analzye the clusters and will not get any usefull results. Maybe then there is no point of the clustering at all – Rayan Sep 26 '18 at 10:08

1 Answers1

0

Yes, 375000^2 exceeds the length of a vector.

The size of a matrix is roughly rows * cols * size of datatype.

Compute the amount of memory you need, then go back to your supervisor with that result.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thank you, what do you mean by size of datatype? you mean if I run the model on a computer with much larger RAM, then this will work? I still can´t get one thing: Is the problem with my computer specifications or the R software itself? – Rayan Sep 26 '18 at 10:06
  • Sizeof(double) vs. sizeof(float). No. Adding RAM won't be enough, you'd need to modify all the code to use 64 bit indexes, not 31 bit integers. Plus, did you compute yet how much RAM you'd need? I doubt you'll "just" add 1 terabyte (plus, you probably need 1 or 2 working copies, too), nor wait for that matrix to be initialized and the O(n³) algorithm to finish. – Has QUIT--Anony-Mousse Sep 26 '18 at 22:17