Clustering using a custom distance metric for lat/long pairs

Question

I'm trying to specify a custom clustering function for the scikit-learn DBSCAN implementation:

def geodistance(latLngA, latLngB):
    print latLngA, latLngB
    return vincenty(latLngA, latLngB).miles

cluster_labels = DBSCAN(
            eps=500,
            min_samples=max(2, len(found_geopoints)/10),
            metric=geodistance
).fit(np.array(found_geopoints)).labels_

However, when I print out the arguments to my distance function they aren't at all what I would expect:

[ 0.53084126  0.19584111  0.99640966  0.88013373  0.33753788  0.79983037
  0.71716144  0.85832664  0.63559538  0.23032912]
[ 0.53084126  0.19584111  0.99640966  0.88013373  0.33753788  0.79983037
  0.71716144  0.85832664  0.63559538  0.23032912]

This is what my found_geopoints array looks like:

[[  4.24680600e+01   1.40868060e+02]
 [ -2.97677600e+01  -6.20477000e+01]
 [  3.97550400e+01   2.90069000e+00]
 [  4.21144200e+01   1.43442500e+01]
 [  8.56111000e+00   1.24771390e+02]
...

So why aren't the arguments to the distance function latitude longitude pairs?

What does `vincenty` do? Why are you looking at `len(found_geopoints)`? — Floris, May 02 '14 at 04:14
The minimum samples to form a cluster depends on the number of geopoints available. vincenty is an implementation of https://en.wikipedia.org/wiki/Vincenty's_formulae — Nathan Breit, May 02 '14 at 04:20
Do I conclude that `latLngA` and `latLngB` are identical - and both are 10 elements long? What do you know about DBSCAN? What is the total size of your `found_geopoints`? What are the units you are working in? Degrees? — Floris, May 02 '14 at 04:31
latLngA and latLngB are identical. len(found_geopoints) == 83. I'm working in degrees. I first read about DBSCAN a few day's ago. — Nathan Breit, May 02 '14 at 05:27
ELKI (not python, though, but Java) has built-in support for geodetic distance, as well as full index acceleration (using R*-trees) for it. This will run in `O(n log n)` instead of `O(n^2)`. — Has QUIT--Anony-Mousse, May 02 '14 at 14:04
I have a similar issue with KNN. Did you find a resolution that allowed you to use the custom function? — user2757902, May 03 '15 at 05:04

score 4 · Answer 1 · answered May 02 '14 at 05:38

4

I seem to have found a work around where I compute a distance matrix using: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html then use it as an argument to DBSCAN(metric='precomputed').fit(distance_matrix)

answered May 02 '14 at 05:38

Nathan Breit

1,661
13
33

eos · Answer 2 · 2016-08-03T17:58:35.170

You can do this with scikit-learn: use the haversine metric with the ball-tree algorithm, and pass radian units into the DBSCAN fit method.

This tutorial demonstrates how to cluster spatial lat-long data with scikit-learn's DBSCAN using the haversine metric to cluster based on accurate geodetic distances between lat-long points:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Notice that the coordinates are passed into the .fit() method as radian units, and that the epsilon parameter value must also be in radian units as well.

If you want epsilon to be, say 1.5km, then the epsilon parameter value in radian units would = 1.5/6371.

Clustering using a custom distance metric for lat/long pairs

2 Answers2

Linked