-2

I have around 200k latitude & longitude data points. How can I cluster them so that each clusters have latitude & longitude points strictly within radius = 1 km from centroid only?

I tried leadercluster algorithm/package in R but eventhough I specify radius =1 km its not strictly enforcing it i.e. its give clusters with lot of point say 5 - 10 kms from cluster centroid also within the same cluster. So its not meeting my requirement.

Number of points in a cluster can vary & its not problem.

Is there a way to enforce the strict radius constraint in heirarchical or another clustering algorithm? I am looking for the steps & implementation in R/python. I tried searching in stackoverflow but couldn't find a solution in r/python.

How to visualize cluster centroids in google maps after the clustering in done?

EDIT

Parameters I am using in ELKI. Please verify enter image description here

GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80

1 Answers1

2

This is not so much a clustering, but a set cover type of problem. At least if you are looking for a good cover. A clustering algorithm is about finding structure in your data; but you are looking for some forced quantization.

Anyway, here are two strategies you can try e.g. in ELKI:

  • Canopy preclustering with T1=T2=your radius. This should yield a greedy approximation to the cover scenario.
  • Complete linkage hierarchical agglomerative clustering, cut at the desired height. This is fairly expensive (O(n^3)). Any two points in the same cluster have at most this distance, so this is a bit stricter than your requirement.

Beware that you should be using haversine ("geo") distances, not Euclidean!

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks. In hierarchical do you mean to say set height =radius (i.e. 1km) since i am using haversine distance? I tried hierarchical but its running out of memory even though i have 16 gb RAM. – GeorgeOfTheRF May 28 '16 at 17:48
  • Can you explain what Canopy preclustering with T1=T2 does? I am new to ELKI. When i visited the ELKI i couldnt find any installer for windows. Is it a GUI or I need to program in Java? Can you share the windows installer link? – GeorgeOfTheRF May 28 '16 at 17:49
  • T1,T2 are properties of canopy clustering see the publication for their definition. They are not meant to be the same, but you can abuse it this way. ELKI does not need an installer, just download and run the jar (although for full functionality, you will want to work with the sources eventually - GUIs have limitations). – Has QUIT--Anony-Mousse May 28 '16 at 20:11
  • K. In Hierarchical Clustering do you mean to say set height =radius? – GeorgeOfTheRF May 28 '16 at 21:07
  • For complete linkage, height=radius is a safe bet, and height=2*radius should be valid, but you won't know the exact center. But it's not very scalable. If you enable cover trees in ELKI, the first approach will likely be *much* faster. – Has QUIT--Anony-Mousse May 28 '16 at 21:11
  • I ran my data in ELKI on a sample but not able to figure out how to export the results with cluster id into csv. Can you please help? – GeorgeOfTheRF May 29 '16 at 08:56
  • I use `-resulthandler ResultWriter` which produces CSV like files, or my own Java code. I also always add `-db.index something.CoverTree` because that improves performance a lot. – Has QUIT--Anony-Mousse May 29 '16 at 09:00
  • I am using the MiniGUI to run the clustering. I have added the screenshot. Please verify the parameters I am passing. I want radius=500 meters. – GeorgeOfTheRF May 29 '16 at 09:04
  • Where should i give this commands you mentioned to export the result? Where can i find the exported csv? – GeorgeOfTheRF May 29 '16 at 09:07
  • Set `-db.index` for performance and `-resulthandler` to write to a file instead of visualizing. The `-out` or so parameter is the output directory. – Has QUIT--Anony-Mousse May 29 '16 at 11:34
  • Can you explain why "Complete linkage" should be used if i want a circle with radius=500m & not centroid or average linkage? Out of Complete,centroid & average which one best suits my requirement? Please explain – GeorgeOfTheRF Jun 05 '16 at 09:25
  • Average may clearly have points outside the 500m radius. Centroid is only to be used with squared Euclidean I believe. Otherwise the equations don't work. Complete linkage guarantees that any two points have at most this distance. If you want a hard guarantee, use this. But of course you are welcome to try the others, too! – Has QUIT--Anony-Mousse Jun 05 '16 at 10:43