Parallel DBSCAN in ELKI

Question

Here I can see that there exists class clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN, but when I tried to invoke it, I've got error:

java -cp elki.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication -algorithm clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN  -algorithm.distancefunction EuclideanDistanceFunction -dbc.in infile.txt -dbscan.epsilon 1.0 -dbscan.minpts 1 -verbose -out OUTFOLDER

Class 'clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN' not found for given value. Must be a subclass / implementation of de.lmu.ifi.dbs.elki.algorithm.Algorithm

And this class is indeed absent in the list of available classes which was printed out with error message:

-> clustering.CanopyPreClustering
-> clustering.DBSCAN
-> clustering.affinitypropagation.AffinityPropagationClusteringAlgorithm
-> clustering.em.EM
-> clustering.gdbscan.GeneralizedDBSCAN
-> clustering.gdbscan.LSDBC
-> clustering.GriDBSCAN
-> clustering.hierarchical.extraction.HDBSCANHierarchyExtraction
-> clustering.hierarchical.extraction.SimplifiedHierarchyExtraction
-> clustering.hierarchical.extraction.ExtractFlatClusteringFromHierarchy
-> clustering.hierarchical.SLINK
-> clustering.hierarchical.AnderbergHierarchicalClustering
-> clustering.hierarchical.AGNES
-> clustering.hierarchical.CLINK
-> clustering.hierarchical.SLINKHDBSCANLinearMemory
-> clustering.hierarchical.HDBSCANLinearMemory
-> clustering.kmeans.KMeansSort
-> clustering.kmeans.KMeansCompare
-> clustering.kmeans.KMeansHamerly
-> clustering.kmeans.KMeansElkan
-> clustering.kmeans.KMeansLloyd
-> clustering.kmeans.parallel.ParallelLloydKMeans
-> clustering.kmeans.KMeansMacQueen
-> clustering.kmeans.KMediansLloyd
-> clustering.kmeans.KMedoidsPAM
-> clustering.kmeans.KMedoidsEM
-> clustering.kmeans.CLARA
-> clustering.kmeans.BestOfMultipleKMeans
-> clustering.kmeans.KMeansBisecting
-> clustering.kmeans.KMeansBatchedLloyd
-> clustering.kmeans.KMeansHybridLloydMacQueen
-> clustering.kmeans.SingleAssignmentKMeans
-> clustering.kmeans.XMeans
-> clustering.NaiveMeanShiftClustering
-> clustering.optics.DeLiClu
-> clustering.optics.OPTICSXi
-> clustering.optics.OPTICSHeap
-> clustering.optics.OPTICSList
-> clustering.optics.FastOPTICS
-> clustering.SNNClustering
-> clustering.biclustering.ChengAndChurch
-> clustering.correlation.CASH
-> clustering.correlation.COPAC
-> clustering.correlation.ERiC
-> clustering.correlation.FourC
-> clustering.correlation.HiCO
-> clustering.correlation.LMCLUS
-> clustering.correlation.ORCLUS
-> clustering.onedimensional.KNNKernelDensityMinimaClustering
-> clustering.subspace.CLIQUE
-> clustering.subspace.DiSH
-> clustering.subspace.DOC
-> clustering.subspace.HiSC
-> clustering.subspace.P3C
-> clustering.subspace.PreDeCon
-> clustering.subspace.PROCLUS
-> clustering.subspace.SUBCLU
-> clustering.meta.ExternalClustering
-> clustering.trivial.ByLabelClustering
-> clustering.trivial.ByLabelHierarchicalClustering
-> clustering.trivial.ByModelClustering
-> clustering.trivial.TrivialAllInOne
-> clustering.trivial.TrivialAllNoise
-> clustering.trivial.ByLabelOrAllInOneClustering
-> clustering.uncertain.FDBSCAN
-> clustering.uncertain.CKMeans
-> clustering.uncertain.UKMeans
-> clustering.uncertain.RepresentativeUncertainClustering
-> clustering.uncertain.CenterOfMassMetaClustering

I thought that perhaps this method is internal and is invoked by clustering.gdbscan.GeneralizedDBSCAN, but it works single core for me. Maybe I need to add some command line parameter to enable multiprocessing?

EDIT: thanks to @erich-schubert, now I can see the time estimation. I have used M-tree index there as shown in docs:

java -Xmx32000M -cp elki-bundle-0.7.2-SNAPSHOT.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication -algorithm clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN -db.index tree.metrical.mtreevariants.mtree.MTreeFactory -treeindex.pagesize 4096 -mtree.distancefunction EuclideanDistanceFunction -algorithm.distancefunction EuclideanDistanceFunction -dbc.in dump_txt.txt -dbscan.epsilon 1.0 -dbscan.minpts 1 -verbose -out RES

I've got warning about ignored parameter: following parameters were not processed: [-treeindex.pagesize, 4096]

and quite depressive time estimation which continues to grow:

de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.load: 553728 ms
Relation does not have a dimensionality -- simulating M-tree as external index!
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.directory.capacity: 200
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.directory.minfill: 0
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.leaf.capacity: 333
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.leaf.minfill: 0
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.construction: 806160 ms
Index statistics before running algorithms:
de.lmu.ifi.dbs.elki.persistent.MemoryPageFile.reads: 22344677
de.lmu.ifi.dbs.elki.persistent.MemoryPageFile.writes: 3831053
de.lmu.ifi.dbs.elki.persistent.MemoryPageFile.numpages: 17472
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.height: 2
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.AbstractMTree$Statistics.distancecalcs: 1773733054
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.AbstractMTree$Statistics.knnqueries: 0
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.AbstractMTree$Statistics.rangequeries: 0
DBSCAN clustering:     708 [  0%] 33738 min remaining

My data is 3.5M 300-dimensional word2vec vectors (float). Can I optimize it somehow to run in a reasonable time?

I use -dbscan.minpts 1 because I've just found the distance between vectors which corresponds to similarity.

EDIT2: R-tree index is a bit faster: DBSCAN clustering: 4423 [ 0%] 17248 min remaining

@Anony-Mousse The latest 0.7.1 (2016, Feburary 11) from [https://elki-project.github.io/releases/](https://elki-project.github.io/releases/) — Slowpoke, Jan 23 '18 at 11:04
@Anony-Mousse Ok, I build jar from github image and it works. unfortunately, it does not output any progress indication (though all the cores are loaded, so it works), so I can't estimate whether it will be finished and when — Slowpoke, Jan 23 '18 at 14:04

Erich Schubert · Accepted Answer · 2018-01-24T12:32:22.287

2

The parallel DBSCAN version is not in the 0.7.1 release, but you need to compile it yourself.

It currently does not include progress logging, and it is a rather naive parallelization. It works okay if the majority of time is spent in neighbor search, because the cluster labeling is synchronized. (But if all your cores are loaded, synchronization should be fine).

I just pushed a change that adds progress logging to Parallel GDBSCAN.

Make sure to add an index. For most data sets, indexes yield considerable speedups. With indexes, the rather poor parallelization of this implementation will surface, and you see more and more threads waiting for synchronization.

edited Jan 24 '18 at 12:32

answered Jan 24 '18 at 10:07

Erich Schubert

8,575
2
26
42

Thank you very much! I'm trying to run the updated source right now. Could you please help me a bit with index: I've tried both R-trees and M-trees from [docs](https://elki-project.github.io/howto/use_indexes), but both ignore page size: `The following parameters were not processed: [-treeindex.pagesize, 4096]`. Is it ok? – Slowpoke Jan 24 '18 at 13:37
I've updated my posts with the results so far, can you please tell me if I am doing everything right? – Slowpoke Jan 24 '18 at 14:08
1

M-trees are really slow to build, so you probably won't have much fun with them. Use *bulk-loaded* R-trees, and cover trees as first tries. The parameter was renamed (use `--help`, as online documentation for different and even unreleased versions, will not know what you have installed), because it does not only apply to trees. But at 300 dimensions, you are seeing the "curse of dimensionality", and indexes will likely not help much, because the distance functions do not offer discrimination anymore. – Erich Schubert Jan 24 '18 at 15:40

Parallel DBSCAN in ELKI

1 Answers1