8

I'm trying to specify a custom clustering function for the scikit-learn DBSCAN implementation:

def geodistance(latLngA, latLngB):
    print latLngA, latLngB
    return vincenty(latLngA, latLngB).miles

cluster_labels = DBSCAN(
            eps=500,
            min_samples=max(2, len(found_geopoints)/10),
            metric=geodistance
).fit(np.array(found_geopoints)).labels_

However, when I print out the arguments to my distance function they aren't at all what I would expect:

[ 0.53084126  0.19584111  0.99640966  0.88013373  0.33753788  0.79983037
  0.71716144  0.85832664  0.63559538  0.23032912]
[ 0.53084126  0.19584111  0.99640966  0.88013373  0.33753788  0.79983037
  0.71716144  0.85832664  0.63559538  0.23032912]

This is what my found_geopoints array looks like:

[[  4.24680600e+01   1.40868060e+02]
 [ -2.97677600e+01  -6.20477000e+01]
 [  3.97550400e+01   2.90069000e+00]
 [  4.21144200e+01   1.43442500e+01]
 [  8.56111000e+00   1.24771390e+02]
...

So why aren't the arguments to the distance function latitude longitude pairs?

Nathan Breit
  • 1,661
  • 13
  • 33
  • What does `vincenty` do? Why are you looking at `len(found_geopoints)`? – Floris May 02 '14 at 04:14
  • 1
    The minimum samples to form a cluster depends on the number of geopoints available. vincenty is an implementation of https://en.wikipedia.org/wiki/Vincenty's_formulae – Nathan Breit May 02 '14 at 04:20
  • Do I conclude that `latLngA` and `latLngB` are identical - and both are 10 elements long? What do you know about DBSCAN? What is the total size of your `found_geopoints`? What are the units you are working in? Degrees? – Floris May 02 '14 at 04:31
  • latLngA and latLngB are identical. len(found_geopoints) == 83. I'm working in degrees. I first read about DBSCAN a few day's ago. – Nathan Breit May 02 '14 at 05:27
  • ELKI (not python, though, but Java) has built-in support for geodetic distance, as well as full index acceleration (using R*-trees) for it. This will run in `O(n log n)` instead of `O(n^2)`. – Has QUIT--Anony-Mousse May 02 '14 at 14:04
  • I have a similar issue with KNN. Did you find a resolution that allowed you to use the custom function? – user2757902 May 03 '15 at 05:04
  • other than using a distance matrix, I did not. – Nathan Breit May 04 '15 at 06:50

2 Answers2

4

I seem to have found a work around where I compute a distance matrix using: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html then use it as an argument to DBSCAN(metric='precomputed').fit(distance_matrix)

Nathan Breit
  • 1,661
  • 13
  • 33
1

You can do this with scikit-learn: use the haversine metric with the ball-tree algorithm, and pass radian units into the DBSCAN fit method.

This tutorial demonstrates how to cluster spatial lat-long data with scikit-learn's DBSCAN using the haversine metric to cluster based on accurate geodetic distances between lat-long points:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Notice that the coordinates are passed into the .fit() method as radian units, and that the epsilon parameter value must also be in radian units as well.

If you want epsilon to be, say 1.5km, then the epsilon parameter value in radian units would = 1.5/6371.

eos
  • 1,475
  • 1
  • 14
  • 25