Implementing a fast DBSCAN in C#

Question

I tried to implement a DBSCAN in C# using kd-trees. I followed the implementation from: http://www.yzuzun.com/2015/07/dbscan-clustering-algorithm-and-c-implementation/

public class DBSCANAlgorithm
{
    private readonly Func<PointD, PointD, double> _metricFunc;


    public DBSCANAlgorithm(Func<PointD, PointD, double> metricFunc)
    {
        _metricFunc = metricFunc;
    }

    public void ComputeClusterDbscan(ScanPoint[] allPoints, double epsilon, int minPts, out HashSet<ScanPoint[]> clusters)
    {
        clusters = null;
        var allPointsDbscan = allPoints.Select(x => new DbscanPoint(x)).ToArray();

        var tree = new KDTree.KDTree<DbscanPoint>(2);
        for (var i = 0; i < allPointsDbscan.Length; ++i)
        {
            tree.AddPoint(new double[] { allPointsDbscan[i].ClusterPoint.point.X, allPointsDbscan[i].ClusterPoint.point.Y }, allPointsDbscan[i]);
        }

        var C = 0;
        for (int i = 0; i < allPointsDbscan.Length; i++)
        {
            var p = allPointsDbscan[i];
            if (p.IsVisited)
                continue;
            p.IsVisited = true;

            DbscanPoint[] neighborPts = null;
            RegionQuery(tree, p.ClusterPoint.point, epsilon, out neighborPts);
            if (neighborPts.Length < minPts)
                p.ClusterId = (int)ClusterIds.NOISE;
            else
            {
                C++;
                ExpandCluster(tree, p, neighborPts, C, epsilon, minPts);
            }
        }
        clusters = new HashSet<ScanPoint[]>(
            allPointsDbscan
                .Where(x => x.ClusterId > 0)
                .GroupBy(x => x.ClusterId)
                .Select(x => x.Select(y => y.ClusterPoint).ToArray())
            );

        return;
    }

    private void ExpandCluster(KDTree.KDTree<DbscanPoint> tree, DbscanPoint p, DbscanPoint[] neighborPts, int c, double epsilon, int minPts)
    {
        p.ClusterId = c;
        for (int i = 0; i < neighborPts.Length; i++)
        {
            var pn = neighborPts[i];
            if (!pn.IsVisited)
            {
                pn.IsVisited = true;
                DbscanPoint[] neighborPts2 = null;
                RegionQuery(tree, pn.ClusterPoint.point, epsilon, out neighborPts2);
                if (neighborPts2.Length >= minPts)
                {
                    neighborPts = neighborPts.Union(neighborPts2).ToArray();
                }
            }
            if (pn.ClusterId == (int)ClusterIds.UNCLASSIFIED)
                pn.ClusterId = c;
        }
    }

    private void RegionQuery(KDTree.KDTree<DbscanPoint> tree, PointD p, double epsilon, out DbscanPoint[] neighborPts)
    {
        int totalCount = 0;
        var pIter = tree.NearestNeighbors(new double[] { p.X, p.Y }, 10, epsilon);
        while (pIter.MoveNext())
        {
            totalCount++;
        }
        neighborPts = new DbscanPoint[totalCount];
        int currCount = 0;
        pIter.Reset();
        while (pIter.MoveNext())
        {
            neighborPts[currCount] = pIter.Current;
            currCount++;
        }

        return;
    }
}

//Dbscan clustering identifiers
public enum ClusterIds
{
    UNCLASSIFIED = 0,
    NOISE = -1
}

//Point container for Dbscan clustering
public class DbscanPoint
{
    public bool IsVisited;
    public ScanPoint ClusterPoint;
    public int ClusterId;

    public DbscanPoint(ScanPoint point)
    {
        ClusterPoint = point;
        IsVisited = false;
        ClusterId = (int)ClusterIds.UNCLASSIFIED;
    }
}

, and modifying the regionQuery(P, eps) to invoke the nearest neighbour function of a kd-tree. To do so, I used the kd-sharp library for C#, which is one of the fastest kd-tree implementations out there.

However, when given a dataset of about 20000 2d points, its performance is in the region of 40s, as compared to the scikit-learn python implementation of DBSCAN, which given the same parameters, takes about 2s.

Since this algorithm is for a C# program that I am writing, I am stuck using C#. As such, I would like to find out what am I still missing out in terms of optimization of the algorithm?

Might be better over on code review. Also, you should probably show us your implementation, and just provide a link to the algorithm on wikipedia. It's possible you've made a mistake while writing the algorithm (or we can point out flaws/improvements in your code specific to C#) — Rob, Oct 28 '15 at 03:39
Looks a lot better. However, I'd still recommend posting your question on http://codereview.stackexchange.com/ - It's likely to be closed here, and the people at `codereview` are probably better equipped to find performance issues for you. — Rob, Oct 28 '15 at 03:48
Try running some profiling tools, see where the bottleneck is. Your code looks fine, it could be your library is letting you down — mksteve, Oct 28 '15 at 07:47
@JohnTan - Check to make sure you're not doing something silly like compiling your C# code in DEBUG mode instead of RELEASE mode; that could account for some slow-down. — Louis Ricci, Oct 28 '15 at 11:02
Thanks for the comments. I got the solution from here: http://codereview.stackexchange.com/questions/108965/implementing-a-fast-dbscan-in-c. Apparently the line `neighborPts = neighborPts.Union(neighborPts2).ToArray();` is slowing the system down, rather than `RegionQuery()` — John Tan, Oct 29 '15 at 02:09

Implementing a fast DBSCAN in C#

0 Answers0