0

I have a large dataset (~188000 rows), I want to calculate the distance between my rows so I can then apply the hclust function to determine the centers of my dataset and later apply the kmeans function to classify my data.

My problem is with the first step which is calculating my matrix distance: using the function dist from the package stats gave me this error:

Error: cannot allocate vector of size 132.0 Gb

It's clear that it's a RAM problem.

I need to find another way to calculate my distance matrix.

Any clear answer would be so helpful for me.

sarah
  • 229
  • 5
  • 13
  • 2
    Problem is N^2 memory complexity not `dist` implementation. – zero323 Feb 19 '16 at 14:56
  • @zero323 thank you for the clarification. But how can I proceed in that case. – sarah Feb 19 '16 at 15:14
  • Truth be told I am not sure why you want to use hclust before kmeans. If you want to optimize initialization just use kmeans++ / kmeans|| during initialization. I won't point to any R implementation but I am pretty sure there is one. – zero323 Feb 19 '16 at 15:19
  • Possible duplicate of [dist function with large number of points](http://stackoverflow.com/questions/16190214/dist-function-with-large-number-of-points) – sebastian-c Feb 19 '16 at 15:26
  • @zero323 using the kmeans++ method seemed awesome for me. But, in application in R I encountred this error `Error in kmeans(data, centers = data[center_ids, ], iter.max = iter.max, : initial centers are not distinct` that I didn't understand. – sarah Feb 19 '16 at 16:43
  • @sebastian-c I saw this question before posting my own question. The responses in that one doesn't respond to my needs. – sarah Feb 19 '16 at 16:44
  • I am having the same problem, how did you solved this? – Amaranta_Remedios Oct 03 '20 at 15:06

0 Answers0