Implementing DBSCAN in distributed system

Question

I have a Big Data problem and I have very limited experience with parallel processing and Big data. I have 100s of millions of rows consisting of Latitude and Longitude data and several ID. For each ID I can have data ranging form 10000 -10 million.

I am implementing the density based clustering algorithm (DBSCAN) to solve some business requirement. The clustering algorithm runns independently for each ID.

Current Implementation;

The current implementation is based out of Python code using sklearn Machine learning library, But it takes a day or more to perform (the clustering + other business logic) for appx 50 million datapoint.

I can optimize the python code and reduce the time but I am looking for a sollution thats more feasible.

Availability

I have a spark cluster distributed accross appx 20 machines but pyspark has no implementation of DBSCAN. Upon some searching i could find some scala imlementation but they seem to be less reliable. The URL's from my search are. https://github.com/irvingc/dbscan-on-spark

DBSCAN on spark : which implementation

Since all my code is written in python I would like to stick with a solution thats more pythonic.

Like I mentioned that the clustering algorithm runs independently for each devices, One way to reduce the time is to distribute the computation of each ID parallely to all 20 machines. So that I could atleast get 20x better performance. But I have no idea on how to achieve this. All I can think of is MapReduce.

I am open to any sollution thats more robust. Any help would be greatly appreciated.

score 1 · Answer 1 · answered Sep 23 '17 at 09:31

The overhead of pySpark is not negligible because of serialization. If you want to be really fast, use as few layers as possible to reduce overhead.

I'd simply split the data into the desired partitions, then process them independently on separate nodes using the fastest DBSCAN you can find (benchmark! Make sure to enable data indexing, and check the result for correctness. One of the Spark versions was reported to have incorrect results). There recently was a benchmarking paper that observed 1000x runtime differences for DBSCAN implementations. So another DBSCAN can make a difference.

score 0 · Answer 2 · answered Feb 26 '18 at 09:41

0

You can try this example https://github.com/bwoneill/pypardis on pyspark and scikit-learn. I tried it locally. The calculation of 75,000 points took almost 1.5 hours. But maybe in claster it will be faster.

answered Feb 26 '18 at 09:41

Valeriy K.

2,616
1
30
53

I also found PySpark implementation in this [github](https://github.com/SalilJain/pyspark_dbscan) May I ask you kindly to check this [colab notebook](https://colab.research.google.com/drive/1JU3eeQ09pxqE8TTzeO3MGT7X3bs38WzC?usp=sharing) to why i couldn't run DBSCAN and for quick debugging? – Mario Oct 06 '21 at 14:02

Implementing DBSCAN in distributed system

2 Answers2