I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.
Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?
Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).
Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?