I am trying to reduce the spatial data set size by clustering them and finding the center point for the clusters. I referenced to this article (which uses DBSCAN
)and it kind of helped except that now the data set size has increased, I am now unable to go forward b/c of memory errors. So, i switched to next best thing HDBSCAN
. But, I am getting some strange results.
First, I am using following:
clusterer = hdbscan.HDBSCAN(min_samples=1, min_cluster_size=25, algorithm='prims_balltree', metric='haversine')
This is able to provide clusters but when I dig into these clusters, they are practically the same. e.g. two clusters comprising of similar geo-locations. My idea is that it should have been a single cluster.
Second, To resolve such the above problem, I tried using cluster_selection_epsilon=0.1/6371
to cluster geo-locations within 100m
in same cluster.
clusterer = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=10, metric='haversine',cluster_selection_epsilon=0.1/6371)
But, then i get this one big cluster with over hundred thousand points and while plotting on folium
I found that those points are not within 100m
apart, rather they are separate clusters of points that are mre than 100m apart.
I am probably not using the min_cluster_size
in terms of haversine
metric.
Can someone explain what's happening. How can I achieve my goal of clustering similar geo-locations. and narrow down the cluster to one center point?