5

I have objects and a distance function, and want to cluster these using DBSCAN method in scikit-learn. My objects don't have a representation in Euclidean space. I know, that it is possible to useprecomputed metric, but in my case it's very impractical, due to large size of distance matrix. Is there any way to overcome this in scikit-learn? Maybe, are there another python implementations of DBSCAN that can do so?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Sergey Sosnin
  • 1,313
  • 13
  • 30
  • 2
    Why don't you want to use metric parameter in constructor? – Ibraim Ganiev Oct 05 '15 at 11:41
  • 1
    Following @Olologin's comment, `metric` parameter in the constructor of DBSCAN accepts either a string (for an already implemented distances) or a callable (a function that for a given 2 elements, returns a distance measure). Write your own and initialize DBSCAN with `metric=my_func`. – Imanol Luengo Oct 05 '15 at 15:00

2 Answers2

8

scikit-learn has support for a large variety of metrics.

Some of them can be accelerated using the kdtree (very fast), using the ball tree (fast), using precomputed distance matrixes (fast, but needs a lot of memory) or no precomputation but Cython implementations (quadratic runtime) or even python callbacks (very slow).

This last option that is implemented but extremely slow:

def mydistance(x,y):
  return numpy.sum((x-y)**2)

labels = DBSCAN(eps=eps, min_samples=minpts, metric=mydistance).fit_predict(X)

is, unfortunately, much much much much slower than

labels = DBSCAN(eps=eps, min_samples=minpts, metric='euclidean').fit_predict(X)

I found ELKI to perform much better when you need to use your own distance functions. Java can compile them into near native code speed using the Hotspot JNI compiler. Python (currently) cannot do this.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I don't think DBSCAN will work with user defined metric such as 'mydistance'. The documentation says: If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. – vdesai Sep 03 '16 at 06:38
  • It works (otherwise, why would they mention "callable"). I used it. It's just really slow because of the python interpreter compared to the Cython metrics. – Has QUIT--Anony-Mousse Sep 03 '16 at 09:34
0

I wrote my own distance code ref the top answer, just as it says, it was extremely slow, the built-in distance code was much better. I'm wondering how to speed up.

Terence Yang
  • 558
  • 6
  • 9