I have implemented the DBSCAN algorithm in R, and i am matching the cluster assignments with the DBSCAN implementation of the fpc library. Testing is done on synthetic data which is generated as given in the fpc library dbscan example:
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n, sd=0.3))
Clustering is done with parameters as below:
eps = 0.2
MinPts = 5
I am comparing the cluster assignments of the fpc::dbscan
with my implementation of dbscan
. Maximum of the runs shows every point was classified identically by both implementations.
But there are some cases where 1 or 2 points and some rare times 5 or 6 points are assigned to different clusters in my implementation than that in the fpc implementation. I have noticed that only border points classification differs. After plotting i have seen that the points whose cluster membership does not match in the implementations are in such a position, such that it can be assigned to any of its surrounding clusters, depending on from which cluster's seed point it was discovered first.
I am showing an image with 150 points (to avoid clutter), where 1 point classification differs. Note that mismatch point cluster number is always greater in my implementation than the fpc implementation.
Plot of clusters.
Top inset is fpc::dbscan, bottom inset is my dbscan implementation
Note The point which differs in my implementation is marked with an exclamation mark (!) I am also uploading zoomed images of the mismatch section:
My dbscan implementation output
+
are core points
o
are border points
-
are noise points
!
highlights the differing point
fpc::dbscan implementation output
triangles are core points
coloured circles are border points
black circles are noise points
Another example:
My dbscan implementation output
fpc::dbscan implementation output
EDIT
Equal x-y scaled example
As requested by Anony-Mousse
In different cases sometimes it seems that my implementation has classified the mismatch point correctly and sometimes it seems fpc implementation has classified the mismatch correctly. See below:
fpc::dbscan (with the triangle plot ones) seems to have classified the mismatch point correctly
my dbscan implementation (with + plot ones) seems to have classified the mismatch point correctly
Question
I am new into cluster analysis therefore i have another question: is these type of difference allowable?
In my implementation i am scanning from the first point to the last point as it is supplied, also in
fpc::dbscan
the points are scanned in the same order. In such case both of the implementation should have discovered the mismatch point (marked by!
) from the same cluster center. Also i have generates some cases in whichfpc::dbscan
marks a point as noise, but my implementation assigns it to some clusters. In this case why is this difference occurring?
Code segments on request.