I have a Big Data problem and I have very limited experience with parallel processing and Big data. I have 100s of millions of rows consisting of Latitude and Longitude data and several ID. For each ID I can have data ranging form 10000 -10 million.
I am implementing the density based clustering algorithm (DBSCAN) to solve some business requirement. The clustering algorithm runns independently for each ID.
Current Implementation;
The current implementation is based out of Python code using sklearn Machine learning library, But it takes a day or more to perform (the clustering + other business logic) for appx 50 million datapoint.
I can optimize the python code and reduce the time but I am looking for a sollution thats more feasible.
Availability
I have a spark cluster distributed accross appx 20 machines but pyspark has no implementation of DBSCAN. Upon some searching i could find some scala imlementation but they seem to be less reliable. The URL's from my search are. https://github.com/irvingc/dbscan-on-spark
DBSCAN on spark : which implementation
Since all my code is written in python I would like to stick with a solution thats more pythonic.
Like I mentioned that the clustering algorithm runs independently for each devices, One way to reduce the time is to distribute the computation of each ID parallely to all 20 machines. So that I could atleast get 20x better performance. But I have no idea on how to achieve this. All I can think of is MapReduce.
I am open to any sollution thats more robust. Any help would be greatly appreciated.