6

I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error. The line of code is

db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X)    

where X is a csr_matrix. The error is the following:

Metric 'cosine' not valid for algorithm 'auto',

though the documentation says that it is possible to use this metric. I tried to use option algorithm='kd_tree' and 'ball_tree' but got the same. However, there is no error if I use euclidean or, say, l1 metric.

The matrix X is large, so I can't use a precomputed matrix of pairwise distances.

I use python 2.7.6 and scikit-learn 0.16.1. My dataset doesn't have a full row of zeros, so cosine metric is well-defined.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
cheyp
  • 391
  • 2
  • 4
  • 12
  • This is arguably a bug in sklearn, frankly. Cosine similarity isn't a metric. It doesn't obey the triangle inequality, which is why it won't work with a KDTree and you have no choice but to brute force it. All of which raises the question of why when you set algorithm to 'auto,' it attempts to use a method it should know it can't use. – Adam Acosta Nov 17 '15 at 05:33
  • @AdamAcosta: If I understand correctly, you're arguing that the `'auto'` `algorithm`-keyword should use `'brute'` rather than try and fail using `'ball_tree'` ? (I'd agree.) – Nikana Reklawyks Jan 19 '17 at 23:58

2 Answers2

10

The indexes in sklearn (probably - this may change with new versions) cannot accelerate cosine.

Try algorithm='brute'.

For a list of metrics that your version of sklearn can accelerate, see the supported metrics of the ball tree:

from sklearn.neighbors.ball_tree import BallTree
print(BallTree.valid_metrics)
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks! Now it works. Firstly, it gave me an error because I used `np.float32` instead of `np.double` for my dataset. I suppose that DBSCAN requires such precision for the cosine metric since the latter has a small range (between 0 and 1). – cheyp Sep 23 '15 at 18:27
  • That should not be necessary in general, but the sklearn implementation may have such limitations. – Has QUIT--Anony-Mousse Sep 23 '15 at 19:53
  • As of today (October 2019) the 'brute' algorithm does not work, but the 'generic' one does. As noted before, the .fit method needs double precision – aless80 Oct 03 '19 at 12:32
10

If you want a normalized distance like the cosine distance, you can also normalize your vectors first and then use the euclidean metric. Notice that for two normalized vectors u and v the euclidean distance is equal to sqrt(2-2*cos(u, v)) (see this discussion)

You can hence do something like:

Xnorm = np.linalg.norm(X,axis = 1)
Xnormed = np.divide(X,Xnorm.reshape(Xnorm.shape[0],1))
db = DBSCAN(eps=0.5, min_samples=2, metric='euclidean').fit(Xnormed) 

The distances will lie in [0,2] so make sure you adjust your parameters accordingly.

Community
  • 1
  • 1
benbo
  • 1,471
  • 1
  • 16
  • 29
  • Could you expand a little bit more on why the DBSCAN algorithm with euclidian-distance-on-normalised-vectors would yield the same result as with straightforwardly-cosine distance, if that is the case ? In particular, what's with the squaring/square-root, and does it matter that cosine really measures *similarity* and not distance (the distance is `1-cos(.;.)`) – Nikana Reklawyks Jan 20 '17 at 00:12
  • For instance, if you know that `eps` should be set to `x` with cosine distance, then it should be set to `sqrt(x)` when using DBSCAN with `euclid`. And, if such is the data, is the sklearn indexing accomplishing its fastening purpose all right ? – Nikana Reklawyks Jan 20 '17 at 00:28