5

I have implemented the DBSCAN algorithm in R, and i am matching the cluster assignments with the DBSCAN implementation of the fpc library. Testing is done on synthetic data which is generated as given in the fpc library dbscan example:

n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n, sd=0.3))

Clustering is done with parameters as below:

eps = 0.2
MinPts = 5

I am comparing the cluster assignments of the fpc::dbscan with my implementation of dbscan . Maximum of the runs shows every point was classified identically by both implementations.

But there are some cases where 1 or 2 points and some rare times 5 or 6 points are assigned to different clusters in my implementation than that in the fpc implementation. I have noticed that only border points classification differs. After plotting i have seen that the points whose cluster membership does not match in the implementations are in such a position, such that it can be assigned to any of its surrounding clusters, depending on from which cluster's seed point it was discovered first.

I am showing an image with 150 points (to avoid clutter), where 1 point classification differs. Note that mismatch point cluster number is always greater in my implementation than the fpc implementation.

Plot of clusters.

Top inset is fpc::dbscan, bottom inset is my dbscan implementation

Plot of clusters. Top inset is fpc::dbscan, bottom inset is my dbscan implementation

Note The point which differs in my implementation is marked with an exclamation mark (!) I am also uploading zoomed images of the mismatch section:


My dbscan implementation output

+ are core points

o are border points

- are noise points

! highlights the differing point

my dbscan implementation


fpc::dbscan implementation output

triangles are core points coloured circles are border points black circles are noise points enter image description here


Another example:

My dbscan implementation output

enter image description here


fpc::dbscan implementation output

enter image description here


EDIT

Equal x-y scaled example

As requested by Anony-Mousse

In different cases sometimes it seems that my implementation has classified the mismatch point correctly and sometimes it seems fpc implementation has classified the mismatch correctly. See below:

fpc::dbscan (with the triangle plot ones) seems to have classified the mismatch point correctly

enter image description here

my dbscan implementation (with + plot ones) seems to have classified the mismatch point correctly

enter image description here

Question

  • I am new into cluster analysis therefore i have another question: is these type of difference allowable?

  • In my implementation i am scanning from the first point to the last point as it is supplied, also in fpc::dbscan the points are scanned in the same order. In such case both of the implementation should have discovered the mismatch point (marked by !) from the same cluster center. Also i have generates some cases in which fpc::dbscan marks a point as noise, but my implementation assigns it to some clusters. In this case why is this difference occurring?

Code segments on request.

Community
  • 1
  • 1
phoxis
  • 60,131
  • 14
  • 81
  • 117
  • Can you show an example where fpc marks a point as noise that shouldn't? Try to scale the plots such that x and y have the same scale, this makes the distances more intuitive. – Has QUIT--Anony-Mousse Jun 02 '12 at 08:40
  • i have added equal xy scaled image of a new reading – phoxis Jun 02 '12 at 09:00
  • There also seems to be a black cluster in the plot next to the cyan one. So two of the points there might look like noise points on the left, but acually are border points of the black cluster (the black triangle seems to be a core point). – Has QUIT--Anony-Mousse Jun 02 '12 at 09:07
  • The image 6 from top is ambiguous, as fpc plot has used black to mark the cluster and noise. Yes the triangle is core point. In this image consider the right plot to identify the noise points. – phoxis Jun 02 '12 at 09:15
  • So for the two new plots, in my opinion both are correct. The mismatches are border points, and they may be assigned to any nearby cluster. There is no tie-braking rule in DBSCAN that says that border points must be assigned to the nearest core point (plus, there could be two equally far). This could easily be done in a preprocessing step. – Has QUIT--Anony-Mousse Jun 02 '12 at 09:18
  • Yes, i agree about this issue. But i have confirmed that fpc is scanning the points in the same order i am scanning, so if a point is classified as a border point from in the i th cluster in fpc, then it should also be classified as a border point from the i th in my implementation. Also, let me know if classifying noise as border points, or vice-versa is acceptable (i don't think so). – phoxis Jun 02 '12 at 09:26
  • 1
    DBSCAN results should agree on noise/border/core status of points, the may only vary in the cluster assignments of border points. Do you have an example where a border point becomes noise? As for `fpc::dbscan`, it looks to me as if it overwrites cluster assignments, thus keeping points in the last cluster found? At least that's how I read the `fpc` source code, but I'm not an R expert. `cv[reachables] <- cn` says "overwrite cluster assignment for all reachable points" (eventually stealing them from other clusters) to me. – Has QUIT--Anony-Mousse Jun 02 '12 at 09:28
  • Probably this is what is happening. The mismatch points have always higher cluster values in my implementation than the fpc implementation. Therefore it seems the other way, ie. fpc keeping the first assignment and my code is keeping the last one. Although it has been under a week i started R, It will be strange, because i have coded it to not change assignments, possibly misinterpretation. Probably this is the issue, i will have a look at my code in R. – phoxis Jun 02 '12 at 09:40
  • Actually I have the impression that your code might be closer to the literal DBSCAN. But this difference is not relevant, you should get the same result by processing the data set backwards. So it is just a matter of choice whether to keep the first or last. – Has QUIT--Anony-Mousse Jun 02 '12 at 09:52
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/12058/discussion-between-phoxis-and-anony-mousse) – phoxis Jun 02 '12 at 09:55

1 Answers1

5

DBSCAN is known to be order dependant for border points. They will be assigned to the cluster they are first discovered from. If a border point is not dense, but in the vincinity of two dense points from different clusters, it can be assigned to either.

This is why DBSCAN is often described as "order independent, except for border points".

Try shuffling the data (or reversing!), then rerunning your algorithm. The results should change.

As I assume neither your nor the fpc implementation has index support (to speed up range queries and make the algorithm run in O(n log n)), I'd guess that one of the implementations is processing the points in forward order, the other one in backward order. '''Update: indexes should not play much of a role, as they don't change the order across clusters, only within one cluster'''.

Another option for "generating" this difference is to

  • keep the first (non-noise) cluster assignment of each point (IIRC official DBSCAN pseudocode)
  • keep the last cluster assignment of each point (fbc::dbscan seems to do this)

These will also generate different results on objects that are border points to more than once cluster. There also is the possibility to assign these points to both cluters, which will yield a non-strict partitioning of the data set. Usually, the benefits of having a strict partitioning are more important than having a fully deterministic result.

Don't get me wrong: the "overwrite" strategy of fbc::dbscan doesn't substantially change the results. I would probably even implement it that way myself.

Are any non-border points affected?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Till now i have run many runs, but no seed point are affected, but once i saw noise points were affected in my implementation. I am processing the points in forward order and as far as i have checked the fpc code, it is also processing forward. For epsilon-neighborhood query i am using the `nn2` function from the RANN library, and trying to improve the processing so that it becomes O(n lg n) (but its not right now) Can you suggest me how to improve eps-neighborhood queries – phoxis Jun 02 '12 at 08:32
  • I don't use R, so I can't help you there. I'm an ELKI user, and it has various indexes that will automatically be used when I do a range or NN query. How about assigning points to clusters, do you *overwrite* the assignment or *keep* it when it is anything except unassigned and noise? This can also lead to this difference, even when processing in the same order. Because border points can be assigned to clusters more than once. – Has QUIT--Anony-Mousse Jun 02 '12 at 08:39
  • I am assigning all center and border points only once. Noise points are being reassigned. My implementation strictly follows the logic in the original paper. – phoxis Jun 02 '12 at 08:54
  • Does fpc use the same logic? Oh, and from the documentation, `nn2` is only approximate, so it may miss some neighbors? – Has QUIT--Anony-Mousse Jun 02 '12 at 08:57
  • i am using the `eps = 0.0` parameter in the `nn2` therefore it is exact. Also i have run this tests with my manual eps-neighborhood calculation in R (which is terribly costly), which also results the. – phoxis Jun 02 '12 at 09:04
  • Also i want to note that, visually in the examples my implementation's mismatch classifications seems to be wrong, but there are examples where the mismatches in fpc dbscan is visually looking wrongly classified. Have a look at the post update images to get what i am saying. – phoxis Jun 02 '12 at 09:05
  • When i stopped overwriting of cluster assignments in my implementation and took only the first one, fpc and my implementation had identical result everytime. – phoxis Jun 02 '12 at 10:27