1

I'm using Elki LngLatDistanceFunction to cluster Lon/lat points but it's only returning one cluster (was returning more clusters when I used Euclid distance). I tried multiple Epsilon values but I'm still getting one cluster.

    int minPts=20;
    double eps=10;
    ListParameterization params = new ListParameterization();
    params.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, LngLatDistanceFunction.class);
    params.addParameter(DBSCAN.Parameterizer.MINPTS_ID, minPts);
    params.addParameter(DBSCAN.Parameterizer.EPSILON_ID, eps);

    params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbcon);
    params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID, RStarTreeFactory.class);
    params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID, SortTileRecursiveBulkSplit.class);
    params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 600);

    Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
    db.initialize();

    GeneralizedDBSCAN dbscan = ClassGenericsUtil.parameterizeOrAbort(GeneralizedDBSCAN.class, params);
MTA
  • 739
  • 2
  • 9
  • 29
  • You are aware of the *scale* of distances returned by LatLngDistanceFunction? They (necessarily) are not the same scale as Euclidean distance. Randomly trying epsilon values is not a good strategy. – Has QUIT--Anony-Mousse Nov 01 '15 at 18:24
  • I think meters according to http://stackoverflow.com/questions/23684070/using-a-geo-distance-function-on-elki. I know there is a better strategy compared to randomly selecting epsilon but I am just not aware of it right now. Could you please give me some idea on how to achieve that? – MTA Nov 02 '15 at 00:32
  • @Anony-Mousse I did modify the epsilon value to 2 miles = 3218.69 meters but this did not improve the results – MTA Nov 02 '15 at 00:46
  • What is the average distance of your observations? Is a dense cluster found, or do you get a noise cluster? – Has QUIT--Anony-Mousse Nov 02 '15 at 06:51
  • All points are in one cluster. I used the KNNDistanceSampler and this is result "MaxX: 760.0 MaxY: 1653.8102316360676 MinX: 1.0 MinY: 299.5734514358746". I'm currently trying to modify that parameters based on the answer below but no luck yet. – MTA Nov 02 '15 at 16:52

1 Answers1

2

The distance is in meters. Therefore, you need to choose epsilon such that some - but not all points - have more than minPts neighbors.

You can use the KNNDistancesSampler class to estimate the parameter. It is not an automatic estimation. But you can plot the resuling distances, and check for a "knee" in this plot.

Pay attention to the "noise" flag.

  • If you get a single cluster, and it is "noise", then epsilon is too small.
  • If you get a single cluster, and it is a "cluster" (not noise), then epsilon is too large.
  • If you get a single cluster, and it is "noise", then minPts may be too large.
  • If you get a single cluster, and it is a cluster, then minPts may be too small.

For most applications, it is easier to fix minPts to 4, or 10, or 20; and then adjust the epsilon parameter as desired. For geographic applications like yours, it may be much easier to fix the epsilon parameter, and vary the minpts parameter instead. For example, you may know that a distance of less than 10000 meter indicates objects to be "neighbors".

Algorithms such as OPTICS are also helpful to choose the parameter visually. (Use the MiniGUI!)

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
  • How can I use KNNDistanceSampler to plot distances? Thanks. – MTA Nov 02 '15 at 16:54
  • The MiniGUI should automatically visualize the graph, or you can use your favorite tool to visualize the resulting curve - it's a simple 2d curve (rank vs. distance). Judging from the numbers you gave above, something like 400 or 500 may be good - but you really should look at the curve, not the bounding box. – Erich Schubert Nov 03 '15 at 10:35
  • I got the distances for getting the XY values using (XYCurve.Itr it = distanceOrderResult.iterator(); it.valid(); it.advance()) but I dont know where the rank comes from? what characteristic of the curve should i be looking at? – MTA Nov 04 '15 at 13:03
  • See the labels of the axes. – Erich Schubert Nov 04 '15 at 13:09
  • x = Objects and y = 10-NN-distance. I know that it's 10 because params.addParameter(KNNDistancesSampler.Parameterizer.K_ID, 10); so I'm guessing that x = rank and y = distance and I should plot both lines and look at the intersection between the lines? Sorry i'm only a beginner trying to familiarise myself with this stuff. – MTA Nov 04 '15 at 13:27
  • It is *one* 2d line, as seen in the DBSCAN paper. – Erich Schubert Nov 05 '15 at 14:42