alternative similarity measure in DBSCAN?

Question

I test my image set on DBSCAN algorithm in scikit-learn python module . There are alternatives for similarity computing:

# Compute similarities
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))

A weighted measure or something like that i could try, examples?

Has QUIT--Anony-Mousse · Answer 1 · 2013-02-14T07:14:36.293

There exists a generalization of DBSCAN, known as "Generalized DBSCAN".

Actually for DBSCAN you do not even need a distance. Which is why it actually does not make sense to compute a similarity matrix in the first place.

All you need is a predicate "getNeighbors", that computes objects you consider as neighbors.

See: in DBSCAN, the distance is not really used, except to test whether an object is a neighbor or not. So all you need is this boolean decision.

You can try the following approach: initialize the matrix with all 1s. For any two objecs that you consider similar for your application (we can't help you a lot on that, without knowing your application and data), fill the corresponding cells with 0. Then run DBSCAN with epsilon = 0.5, and obviously DBSCAN will consider all the 0s as neighbors.

score 0 · Answer 2 · answered Feb 13 '13 at 13:39

0

You can use whatever similarity matrix you like. It just need to based on a valid distance (symmetric, positive semi-definite).

answered Feb 13 '13 at 13:39

ogrisel

39,309
12
116
125

I don't know other similarity matrices, any example? or list where can i choose? – postgres Feb 13 '13 at 14:55
Cosine similarity between sparse positive vectors (e.g. word frequencies), heat kernel or RBF kernel, similarities based on the l1 (Manhattan) norm instead of the euclidean norm... – ogrisel Feb 13 '13 at 22:20
1

Actually no, it doesn't require to be a valid distance/metric. DBSCAN requires just a binary "isNeighbor" information. There is *no* symmetry requirement, technically. You could use a random matrix and DBSCAN would still work. (The results with proper distances are usually better though). @postgres you need to figure out what is "similar" for your particular task! – Has QUIT--Anony-Mousse Feb 14 '13 at 07:13
Actually, the `DBSCAN` estimator wants distances, not similarities. – Fred Foo Feb 14 '13 at 11:33

score 0 · Answer 3 · answered Aug 26 '13 at 06:41

I believe DBSCAN estimator wants distances and not similarities. But again when it comes to strings it will require a similarity matrix, which can even be a line of code for matching the equality between two strings. Therefore it depends on you how you use the similarity matrix and distinguish between a neighbor and non neighbor objects.

alternative similarity measure in DBSCAN?

3 Answers3