5

I have a need to cluster a data set of lat,long coordinates. I am using python as my language and plan on using DBSCAN as I don't want to have to specify the # of clusters.

The goal and purpose is to be able to input a large data set of lat,long coordinates, which have many features attached, and assign cluster groups that will be returned. The original database which contains entries in the form of [lat long feature1, feature2 ....] needs to be amended with a new field called, "cluster group": [lat long clustergroup feature1, feature2 ....]. This will help me identify which data points are grouped closely together, without having to plot on a map. I am hoping that outliers will be given separate group IDs and points which are largely clustered together will be given the same group ID.

My input to DBSCAN would be x,y coordinates, after I convert the lat,long -->x,y & neglect the z coordinate. I am using:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN http://scikit-learn.org/stable/auto_examples/index.html

I am having difficulty understanding how to setup the input for this function. Am I able to input x,y coordinates? Would this be a list of tuples? If someone could help me visualize this, it would be a great help.

Also, can you explain how DBSCAN would be different from hierarchical clustering?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
bud
  • 485
  • 6
  • 22

1 Answers1

0

First of all, it's DBSCAN, not DB scan - it's an ackronym.

DBSCAN requires dense areas to have more than minPts objects. If you choose a too low minPts value (1 or 2), the results will indeed match single-linkage hierarchical clustering. So use a higher value.

The scipy implementation can use a distance matrix. So just compute all the distances, choose the parameters, and run the function. The scipy documentation also is pretty good, have you read it?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194