1

I have a large dataset of around 20 million points (x,y,z) in a 3-dimensional space. I know these points are organized in dense regions, but that these regions vary in size. I think a standard unsupervised 3D clustering should solve my problem.

Since I can't estimate the number of clusters a priori, I tried using k-means with a wide range for k, but it is slow and also, I would have to estimate how significant each k-partition is. Basically, my question is: how can I extract the most significant partition of my points into clusters?

user1883163
  • 133
  • 9

3 Answers3

5

k-means is probably not the best alhorithm for such data.

DBSCAN should be closer to your intuition of dense regions.

Try on a sample first, then figure out how to scale up.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

It is not clear to me from the above if you're going to use k-means or not, but if you are, you should be following the responses from the post below which shows how to measure variance of the clusters.

Calculating the percentage of variance measure for k-means?

Additionally, you can get a good fit using 'the elbow method' by trying 2 to 15 k sized clusters. See the answer from Amro for the process on this.

Community
  • 1
  • 1
unique_beast
  • 1,379
  • 2
  • 11
  • 23
0

One simple idea in this case is to use 3 different clusterings, along each dimension. That might speed things up.

So you find clusters along X axis (project all the points down to X axis) and then continue to form sub clusters along the Y axis and then along the Z axis.

I think 1-D k-means can be solved very efficiently using dynamic programming http://www.sciencedirect.com/science/article/pii/0025556473900072.

jhegedus
  • 20,244
  • 16
  • 99
  • 167