DBSCAN error with cosine metric in python

Question

I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error. The line of code is

db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X)

where X is a csr_matrix. The error is the following:

Metric 'cosine' not valid for algorithm 'auto',

though the documentation says that it is possible to use this metric. I tried to use option algorithm='kd_tree' and 'ball_tree' but got the same. However, there is no error if I use euclidean or, say, l1 metric.

The matrix X is large, so I can't use a precomputed matrix of pairwise distances.

I use python 2.7.6 and scikit-learn 0.16.1. My dataset doesn't have a full row of zeros, so cosine metric is well-defined.

This is arguably a bug in sklearn, frankly. Cosine similarity isn't a metric. It doesn't obey the triangle inequality, which is why it won't work with a KDTree and you have no choice but to brute force it. All of which raises the question of why when you set algorithm to 'auto,' it attempts to use a method it should know it can't use. — Adam Acosta, Nov 17 '15 at 05:33
@AdamAcosta: If I understand correctly, you're arguing that the `'auto'` `algorithm`-keyword should use `'brute'` rather than try and fail using `'ball_tree'` ? (I'd agree.) — Nikana Reklawyks, Jan 19 '17 at 23:58

Has QUIT--Anony-Mousse · Accepted Answer · 2016-07-13T08:07:14.367

10

The indexes in sklearn (probably - this may change with new versions) cannot accelerate cosine.

Try algorithm='brute'.

For a list of metrics that your version of sklearn can accelerate, see the supported metrics of the ball tree:

from sklearn.neighbors.ball_tree import BallTree
print(BallTree.valid_metrics)

edited Jul 13 '16 at 08:07

answered Sep 23 '15 at 17:15

Has QUIT--Anony-Mousse

76,138
12
138
194

Thanks! Now it works. Firstly, it gave me an error because I used `np.float32` instead of `np.double` for my dataset. I suppose that DBSCAN requires such precision for the cosine metric since the latter has a small range (between 0 and 1). – cheyp Sep 23 '15 at 18:27
That should not be necessary in general, but the sklearn implementation may have such limitations. – Has QUIT--Anony-Mousse Sep 23 '15 at 19:53
As of today (October 2019) the 'brute' algorithm does not work, but the 'generic' one does. As noted before, the .fit method needs double precision – aless80 Oct 03 '19 at 12:32

score 10 · Answer 2 · edited Apr 13 '17 at 12:44

10

If you want a normalized distance like the cosine distance, you can also normalize your vectors first and then use the euclidean metric. Notice that for two normalized vectors u and v the euclidean distance is equal to sqrt(2-2*cos(u, v)) (see this discussion)

You can hence do something like:

Xnorm = np.linalg.norm(X,axis = 1)
Xnormed = np.divide(X,Xnorm.reshape(Xnorm.shape[0],1))
db = DBSCAN(eps=0.5, min_samples=2, metric='euclidean').fit(Xnormed)

The distances will lie in [0,2] so make sure you adjust your parameters accordingly.

edited Apr 13 '17 at 12:44

Community

1
1

answered Nov 01 '16 at 21:24

benbo

1,471
1
16
29

Could you expand a little bit more on why the DBSCAN algorithm with euclidian-distance-on-normalised-vectors would yield the same result as with straightforwardly-cosine distance, if that is the case ? In particular, what's with the squaring/square-root, and does it matter that cosine really measures *similarity* and not distance (the distance is `1-cos(.;.)`) – Nikana Reklawyks Jan 20 '17 at 00:12
For instance, if you know that `eps` should be set to `x` with cosine distance, then it should be set to `sqrt(x)` when using DBSCAN with `euclid`. And, if such is the data, is the sklearn indexing accomplishing its fastening purpose all right ? – Nikana Reklawyks Jan 20 '17 at 00:28

DBSCAN error with cosine metric in python

2 Answers2

Linked